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ABSTRACT 

A  Self-Organizing  Daubase  System  (SODS)  moniiors  queries  asked,  finds  a  good  (or 
opiimal)  database  structure  for  those  queries,  and  suggests  or  does  the  reorganization. 
In  this  thesis  we  describe  a  prototype  SODS  for  single-file  relational  queries  and  give 
an  integrated  analysis  of  its  major  design  problems: 

-  estimation    of    the    number   of    records   satisfying   a   condition    (i.e..   condition 
selectivity) 

-  query  optimization 

-  storing  information  about  a  set  of  queries 

-  optimal  selection  of  secondary  indices 

We    present    new    results    for    each    of    those    problems.       Some    of    this    research    was 
implemented  in  FASTSCAN.  a  commercial  query  system. 

We  present  a  new  method  for  accurate  estimation  of  the  number  of  records 
satisfying  a  condition  fie/d  rel-op  constant,  where  rel-op  is  one  of  "=",  "<".  ">". 
"<",  ">".     We  also  examine  estimates  for  more  complicated  conditions. 

We  present  elementary  operations  (such  as  UNION,  INTERSECT)  on  pointer  and 
record  streams.  We  show  how  to-  use  the  query  parse  tree  to  construct  a  query 
evaluation  method  (EM)  from  those  operations.  Then  we  give  an  algorithm  for 
selecting  the  optimal  EM.  based  on  converting  the  query  to  conjunctive  normal  form. 

We  examine  ways  to  compress  information  about  a  set  of  queries  by  combining 
information  for  similar  queries.  We  derive  a  compression  scheme  which  allows  a 
correct  and  fast  computation  of  the  cost  of  the  average  query  under  any  index  set 

We  combine  all  previous  results  in  analyzing  the  NP-hard  problem  of  optimal  index 
selection.  We  present  two  algorithms  for  it.  The  first  one  always  finds  the  optimal 
answer  and  runs  fast  on  real-size  problems  despite  its  exponential  worst-case 
complexity.  The  second  one  (a  Greedy  method)  runs  much  faster,  yet  finds  the  optimal 
answer  very  frequently. 

We  analyze  the  Maximum  Cover  problem  (also  NP-hard).  a  simplification  of  the 
optimal  index  selection.  We  prove  that  the  Greedy  method  is  an  epsilon-approximate 
algorithm:     its  answer  is  always  more  than  63%  of  the  optimal  answer. 
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Chapter  1  -  4  -  Introduction 

1.  INTRODUCTION 

Big  Brother  Is  Watching  You. 
George  Orwell.  "1984". 

in    this    disserution    we    consider    a   Self-Organizing    Database    System    (SODS)    which 

monitors    single-file    relational    queries    and    finds    a    set    of    secondary    indices    that 

minimizes  the  average  query  cost.     In  the  process  we  solve  problems  in  areas  of  query 

optimization,  estimation  of  selectivity,  and  recording  query  frequencies.     We  also  present 

and  analyze  two  algorithms  for  optimal  index  selection. 

As  the  current  trend  of  explosive  information  growth  continues,  large  database  systems 
are  becoming  more  important  and  more  widespread.  .\s  the  technical  training  of 
daubase  users  declines,  so  rises  the  use  of  relational  databases.  In  the  environment  of 
a  large  relational  DBMS  one  of  the  most  acute  problems  is  the  slow  response  time  to 
interactive  queries.  The  psychological  research  indicates  [Embley  81]  that  if  a  user  has 
to  wait  for  the  system's  answer  for  a  period  longer  than  5-10  seconds,  his  attention 
will  drifL  If  Lhis  is  repeated  frequently,  the  user's  productivity,  ai  well  ab  his  uciire 
to  use  that  system,  will  decrease.  Thi's  queiy  op'imization  -  the  major  prerequisite  for 
a  fast  response  time  -  is  crucial  to  any  database  syrtem. 

Query  optimization  in  relational  DBMS  has  been  the  subject  of  much  work  -  for 
example:  [Astrahan  76.  Wong  76.  Liu  76.  Smith  75,  Hall  76.  Blasgen  77.  "j'ao 
79.  Selinger  79.  Putkonen  79,  Aho  79,  Kim  82],  Usually  the  problem  posed  is  this: 
"given  some  fixed  database  structure,  find  the  optimal  evaluation  method  for  a 
given  single  query".  However,  a  different  approach  is  possible  -  "given  some  fixed 
set  of  queries,  find  the  optimal  database  structure  to  evaluate  it".  This  approach 
is  more  general,  since  the  goal  is  to  minimize  evaluation  cost  not  only  on  each  single 
query,  but  also  on  the  set  of  all  queries. 

1.1   MOTIVATION  BEHIND  SELF-ORGANIZING  DATABASE 
SYSTEMS 

The  main  reason  for  this  approach  is  that  in  a  typical  large  database  user  needs  and 
data  parameters  usually  change  with  time,  reflecting  changes  in  external,  real-world 
requirements.  Thus  to  maintain  the  optimal  performance,  the  DBMS  needs  a  periodic 
restructuring.  This  task  is  the  responsibility  of  the  Database  .Administrator  (DB.A.).  '.vho 
usually  relies  on  educated  guesses  and  feedback  from  selected  users.  In  the 
environment  of  a  large,  complicated  database  system  this  is  no  longer  enough  -  a  more 
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analytic  and  reliable  method,  is  needed.  Experience  with  use  of  relational  systems  shows 
that  users  usually  fail  to  do  necessary  reorganizations  to  maintain  efficiency. 
Stonebraker,  for  example,  says  in   [Stonebraker  80] ,  p.  233: 

There  are  clear  instructions  concerning  how  the  system  catalogs  should  be 
physically  structured  (they  begin  as  heaps  and  should  be  hashed  when  their 
size  becomes  somewhat  suble).  Even  so,  some  users  fail  to  hash  them 
appropriately.  Of  course,  the  system  continues  to  run;  it  just  gets  slower  and 
slower.  We  have  finally  removed  this  particular  decision  from  the  user's 
domain  -  (it  is  done  automatically). 

The  SODS  should  continuously  monitor  the  queries.  When  it  senses  degradation  in 
performance  it  should  analyze  the  usage  and.  if  the  cost  of  reorganization  is  offset  by 
performance  improvement,  it  should  reorganize  itself  (or  at  least  suggest  that  to  the 
DBA).  The  application  programs  can  either  be  all  recompiled  at  once,  or  each  program 
can  be  recompiled  the  first  time  it  is  used  after  the  reorganization.  We  cannot  resist 
the  comparison  of  the  all-knowing  SODS,  which  always  watches  the  user,  with  the  Big 
Brother,  although  intentions  of  the  SODS  (we  hope)  are  more  benign. 

1.2  PREVIOUS  WORK 

About  half  of  the  previous  work  on  SODS  is  on  the  selection  of  secondary  indices. 
The  other  half  is  on  the  attribute  partitioning,  which  is  the  partitioning  of  a  relation 
into  separate  physical  files,  each  containing  a  subset  of  attributes. 

The  idea  of  an  SODS  was  first  suggested  by  Stocker  and  Dearnley  [Stocker 
73,  Dearnley  74]  who  discussed  mainly  dynamic  attribute  partitioning  and  attribute 
clustering  (where  an  attribute  may  exist  redundantly  in  several  clusters).  They  realized 
the    importance    of    cost    estimates    and    forecasting    of    the    pattern    of    use.       Aisberg 

[Alsberg  75]  suggested  that  an  SODS  do  not  only  attribute  (i.e..  vertical)  partitioning 
but  also  the  horizontal  partitioning  of  the  data  file.  Attribute  partitioning  of  records 
into  primary  and  secondary  segments  was  analyzed  by  [Eisner  76.  March  77.  March 
83]. 

Hammer  and  Chan  [Hammer  76.  Chan  76]  have  described  a  prototype  SODS  which 
was  limited  to  single-file  queries  which  were  either  conjunctions  or  disjunctions  of 
equalities.  They  considered  the  optimal  index  selection  and  presented  a  good  heuristic 
algorithm  for  it.  Hammer  and  Niamir  [Hammer  79.  Niamir  78]  while  restricting 
themselves  to  the  same  file  and  query  model,  turned  to  attribute  partitioning.  They 
tested  two  heuristic  algorithms  which  exhibited  good  average  performance. 
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An  area  somewhat  related  to  SODSs  is  file  reorganization  [Batory  82a] .  The  usual 
assumption  is  that  insertions  to  a  file  are  done  using  overflow  pages.  The  research 
question  is  choosing  file  reorganization  pomis. 

Unlike  the  previous  authors,  who  built  or  suggested  prototype  systems,  we  have 
designed  (and  partially  implemented)  our  SODS  at  the  heart  of  a  full-scale  commercial 
relational  DBMS.  \  component  of  lU  the  F.\STSC.-\N'  Query  System,  served  as  a  test 
vehicle  for  development  and  application  of  most  of  the  results  reported  here. 
FASTSCAN  IS  written  m  FORTRAN-77  and  runs  on  VAX-11  under  VMS.  It  is 
currently  used  by  several  large  financial  institutions.  The  files  it  handles  have  from  5 
10  500  fields  with  the  number  of  records  ranging  from  10  to  2.000,000. 

The  main  contribution   of   our  work   is  in  considering  the  design  of   the  SODS,   from 

query    optimization    to   database   reorganization,    as   a   whole  and   in    realistic   detail;    in 

giving  good  and  in  some  cases  optimal  solutions  to  each  design  subproblem:  and  in 
integrating  those  solutions  together. 

1.3  THE  SCOPE  OF  THIS  WORK 

The  foundation  for  our  work  is  the  relational  data  model,  which  provides  a  clean 
separation  between  the  logical  view  of  data  and  its  physical  organization,  opening  the 
door  to  the  proposed  database  reorganizations.  In  this  work  we  present  results  limited 
to  the  important  case  of  single-file  boolean  queries.  Such  a  query  specifies  a  boolean 
condition  on  field  values  from  the  same  record.  e.g.. 

AGE    <    25    AND   DEGREE    '    'Pf>D' 

John  Gashnig  in  (    [Pearl  83] .  p.  3)  is  quoted  as  saying. 

"The  8-puzzle  is  to  [analysis  of  search  algorithms]  as  the  fruit  fly  to  the 
geneticist;  simple,  convenient  to  manipulate,  yet  exhibiting  interesting 
phenomena  that  hypothetically  hold  for  a  broader  class  of  subjects." 

We  feel,  for  the  same  reasons,  that  the  single-file  boolean  query  is  the  fruit  fly  of 
query  optimization  research.  Many  of  the  ideas  we  have  discovered  for  the  single-file 
case  are  applicable  to  the  multifile  case.  Also,  multifile  query  optimization  is  usually 
reduced  to  single-file  query  optimization  [Wong  76].  so  the  latter  is  necessary  for  the 
former.  In  addition,  as  shown  by  [Whang  81].  if  join  methods  are  "separable",  then 
the  optimal  database  structure  can  be  derived  by  optimizing  the  structure  of  each  file 
independently. 
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We  also  limit  ourselves  to  the  context  of  off-line  update.  Thus  we  only  discuss 
queries,  i.e.,  retrievals,  and  not  updates,  deletions  or  insertions.  In  most  cases,  however, 
the  results  obtained  for  queries  are  applicable  to  other  types  of  transactions.  For 
example,  query  optimization  is  necessary  for  update  optimization,  since  an  update  is 
usually  a  query  followed  by  an  I/O. 

Of  the  several  possible  database  reorganizations,  we  have  chosen,  arguably,  the  most 
important  one  -  the  secondary  index  selection.  We  feel  that  selection  of  the  right  set 
of  indices  has  the  biggest  impact  on  query  optimization. 

1.4  MAIN  RESULTS  AND  THESIS  ORGANIZATION 
In  this  thesis  we  present  new  results  in  several  areas. 

In  chapter  2,  we  analyze  estimation  of  the  number  of  tuples  satisfying  a  condition. 
We  present  a  new  method  for  accurate  estimation  of  this  number  for  conditions  of 
the  type  field  rel-op  constant,  where  rel-op  is  one  of  "=",  "<",  "-".  "<".  ">".  Wc 
state  axioms  on  an  estimation  scheme.  Then  we  derive  two  schemes  satisfying  these 
axioms:  one  which  minimizes  the  worst-case  error,  and  another  which  reduces  the 
average  case  error.  We  also  analyze  extensions  of  our  scheme  to  more  complicated 
conditions. 

In  chapter  3,  we  present  a  model  of  query  evaluation.  We  begin  by  analyzing  basic 
operations  on  files  and  indices.  From  these  operations  we  build  basic  operations  on 
streams  of  records  and  pointers.  Then  we  define  a  query  evaluation  method  (EM)  as  a 
sequence  of  stream  operations  on  some  of  the  query  tree  nodes.  EM  cost  depends  on 
the  number  of  tuples  satisfying  conditions  of  those  nodes.  In  this  way  we  formalize 
the  space  of  possible  EMs  for  each  query  tree  and  define  the  optimal  E.M  as  the 
cheapest  EM  among  all  EMs  on  all  equivalent  queries. 

In  chapter  4  we  show  how  to  select  the  optimal  EM  in  our  model.  We  begin  by 
giving  an  algorithm  for  finding  the  optimal  EM  for  queries  in  conjunctive  normal  form 
(CNF).  Then  we  demonstrate  how  to  convert  every  query  to  CNF.  and  prove  that  for 
any  EM  on  any  query  there  is  an  equivalent,  and  possibly  cheaper.  EM  on  a  CNF  of 
that  query.  Therefore,  to  get  the  optimal  EM  we  first  convert  a  query  lo  CNF  and 
then  find  the  optimal  EM  on  the  CNF  query  tree.  We  also  derive  a  formula  for  the 
cost  of  the  optimal  EM. 
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In  chapter  5  we  use  that  formula  to  analyze  different  ways  of  storing  information 
about  a  set  of  queries.  We  determine  exactly  how  much  information  is  necessary  for 
correct  computation  of  the  average  query  cost  under  any  index  combination. 

In  chapter  6  we  attack  the  problem  of  the  optimal  index  selection  (O/S).  Several 
variations  of  OIS  have  been  proven  NP-hard  [Comer  78,  Piateisky  83,  Ip  83].  We 
present  two  algorithms  for  OIS.  The  first  one  is  a  backtracking  "branch  &  bound" 
algorithm  which  is  guaranteed  to  find  the  optimal  answer  and  whose  average  case 
performance  on  real-size  problems  is  very  good,  even  though  its  worst  case  complexity 
is  exponential. 

The  second  algorithm  is  a  "Greedy"  heuristic  nonbacktracking  search  algorithm  whose 
worst  case  complexity  is  0(A/'),  where  /V  is  the  number  of  indices.  While  it  is  no: 
guaranteed  to  always  find  the  optimal  answer,  it  usually  does  so.  One  reason  may  be 
that  it  is  guaranteed  to  perform  well  on  a  simplification  of  OIS,  called  the  Maximum 
Cover  Problem  (which  is  also  NP-hard).  We  prove  that  the  cost  of  the  Greedy 
algorithm's  answer  is  always  at  least  1  -  1/e  -  0.632...  of  the  optim.a!  answer  and  that 
1  -  1/e  is  the  exact  lower  bound. 

Finally,  in  chapter  7  we  summarize  this  work  and  indicate  the  directions  for  future 
research. 

1.5.  NOTATION 

-  We  use  /4,  B.  C.  and  D  for  field  (i.e..  attribute)  names. 

-  X,  Y  and  const  denote  consunts. 

-  rel-op  stands   for  one  of   the   following  relational  operators:   "=",   "<",   "<", 
">",  ">"  . 

-  £.   F,   and   C  are   used    for   boolean   expressions  on   fields  and   constants,   as 
defined  in  section  1.6. 

-  Tuple  IS  used  as  a  synonym  for  record. 

-  #/?  is  the  number  of  records  (or  tuples)  in  a  file. 

Additional  notation  will  be  introduced  as  needed.  A  full  list  of  notations  and 
abbreviations  is  given  in  appendix  II. 
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1.6  QUERY  MODEL 

In  this  section  we  define  our  query  model,   necessary  for  chapter  2.     Our  data  and 
cost  models  will  be  defined  in  chapter  3. 

We  consider  single-file  boolean  queries  of  the  following  form: 
File:  


Condition:  

Output  fields: 


where  File  is  a  file  (or  relation)  name.  Output   fields  is  a  list  of   fields  from   that 
file,  and  Condition  is  defined  by  the  following  semi-formal  BNF  grammar: 

Condition  ::=  BooleanExpr 

BooleanExpr  ::=  term  OR  BooleanExpr   [     term 

term  ::=  factor  AND  term     |     factor 

factor  ::=  c     |     NOT  c 

c  ::=  Comparison     |     (  BooleanExpr  ) 

Note:  &  is  a  synonym  for  AND; 
V  is  a  synonym  for  OR. 

Comparison  ::=  Expr  rel-op  Expr  | 

Expr  BETWEEN  Expr  AND  Expr   | 
Expr  EITHER  Expr,  Expr Expr" 

Noie:  Compared  expressions  must  be  of  the  same  type. 
either  numeric  or  character. 

Expr  ::=  ArithExpr   |   CharExpr 

where  ArithExpr  is  an  arithmetic  expression 

composed  of  numeric  fields  and  constants,  and  CharExpr 

is  either  a  character  constant  or  a  character  field. 

This  query  language,  a  subset  of  the  FASTSC.\N  query  language,  is  similar  to  those 
used  in  System  R   [Astrahan  76]   and  INGRES   [Stonebraker  76]. 

We  also  assume  that  the  same  field  name  is  not  used  twice  in  a  query.  This 
assumption  was  made  implicitly  in  most  of  the  previous  works  on  query  optimization. 
It  holds  for  the  great  majority  of  user  queries,  especially  since  our  query  language 
provides  convenient  operators  BETWEEN  and  EITHER,  allowing  us  to  replace 


"Eiprl    BETWEEN'    Exp'-    ^^D    tr^ri   mcjii-    Evrfl    *    Eipr-    ■^"^D   E'P''    1    i^r'i 
Eiprl    EITHER    E^.pr:.     -.    EnprK    mcjn-    E'.prl    i-    ;^iul    i.'  ji    Icj-.!   one   .>l    E\pr:.  E.prK 
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A  t  X     &■     A  <   Y 
with 

A  BETWEEN  X  &.  Y 

and 

A  =  X      V     A  =  X      V  ...  V  A  =  X 

1  :  K 

with 

A  EITHER  X  .  X X 

Other  "forbidden"  queries  can  be  rephrased  to  satisfy  this  assumption,  for  example: 

(A=10  V  B>0)  &.  (A=10  V  C<0)     ==>     A=10  L  (B>0  v  C<0) 
(A=10  &  B>0)  V  M=10  &  C<0)     ==>     A=10  V  (5>0  &  C<0) 

There  are  forbidden  queries  which  cannot  be  rephrased,  such  as 

(A=10  &  B>0)  V  {A=99  &.  C<0) 

{A=10  &.  B>Q)  V  (>4=10  &.  C<0)  V  (B>0  &.  C<0) 

Such  queries,  however,  are  infrequently  asked  by  users  and  will  be  ignored.' 


c-»iimaic<   *ould    be   bij^cj.    IcjJiiil   in   -onic   cj^c-    lo   ^.Iccinm   ol    *ub<'piiinjl    c^jlujti.'ti    ni.-:ti"J 
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2.  ESTIMATING  THE  NUMBER  OF  RECORDS 
SATISFYING  A  CONDITION 

Estimating  the  number  of  records  satisfying  a  condition  is  an  extremely  important  part 
of  query  optimization  since  the  cost  of  an  evaluation  method  depends  on  the  number 
of  records  or  pointers  retrieved  at  each  step.  In  this  chapter  we  will  analyze  such 
estimation  for  different  types  of  conditions.  Most  of  the  work  described  in  this 
chapter  was  also  reported  in    [Piatetsky  84] . 

Consider,  for  example,  the  query 

Find  the  number  of  records  for  whtch 
SALES  >  20  bin  ion   AND  YEAR  =    77 

on  file  Industrial  CompustaL"*     It  can  be  evaluated  by 

1.  a  sequential  scan 

2.  using  an  index  on  SALES  to  get  a  list  of  pointers  to  all  records  with  S.A.LES 
>  20  billion;  for  each  pointer  get  the  record  and  check  if  YEAR  =  77. 

3.  using  an  index  on  "^"EAR  to  get  a  list  of  pointers  to  all  records  with  \"EaR 
=  77;  for  each  pointer  get  the  record  and  check  if  SALES  >  20  billion. 

> 

4.  intersecting  pointer  lists  from  indices  on  SALES  and  on  YEAR. 

To  select  the  cheapest  method,  we  need  to  estimate  how  many  records  satisfy  SALES 
>  20  billion,  how  many  satisfy  YEAR  =  77,  and  how  many  satisfy  both  conditions. 
The  minimum  value  of  SALES"  in  the  file  is  0,  the  maximum  is  108  billion.  The  values 

of  YEAR  are  spread  almost  equally  among  20  values:   63.  64 82.      The  System   R 

optimizer  [Sehnger  79]  would  estimate  that  only  1/20  (5%)  of  all  records  satisfy  the 
condition  on  YEAR,  while  77/108  (71%)  of  all  records  satisfy  the  condition  on  SALES, 
and  would  choose  method  #3  (index  on  YEAR).  Query  evaluation  using  this  method 
took  137  seconds  of  elapsed  time.  41  seconds  of  CPU  time  and  2414  disc  I/O-s.  Using 
distribution  steps  and  the  estimation  formulas  presented  later  in  this  chapter. 
FASTSCAN  Query  Optimizer  guessed  that  only  2%  of  the  records  satisfy  the  condition 
on  SALES  (the  correct  number  is  0.5%)  and  chose  a  method  #4  (intersection  of 
indices).  Query  evaluation  by  this  method  took  3.6  seconds  of  elapsed  time  (38  times 
faster).  2.4  seconds  of  CPU  time  and  14  disc  I/O. 


Thi'    Ilk-    lia<-    jboui    ?0.000    rccoriJ>  -   20    >cjr-    ol    fiiunci.'il    iiilomi  iiion    I.t    jbi-ui    2.500   ci-nipiru-    irjJ.J     -n    Nc»->  .tL 
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Definition  2.1:  The  selectivity  of  a  logical  expression  £.  denoted 
SEUE).  is  the  number  of  records  satisfying  E.  divided  by  the  total  number 
of  records  in  the  file. 

Notation:  When  discussing  comparisons  on  a  single  field,  we  will  use  SEL(rel-op  X) 
to  denote  the  selectivity  of  the  comparison  field  rel-op  X.  For  example.  SEL(>3500) 
is  the  selectivity  of  field  >  3500. 

SEL-fE)  will  denote  the  estimate  of  SEL(£)  and  SEL-irel-op  X)  will  denote  the 
estimate  of  SEUrel-op  X). 

Estimation  of  the  number  of  records  satisfying  a  query  has  two  parts 

1.  Estimation  of  selectivity  of  comparisons. 

2.  Estimation  of  selectivity  of  logical  expressions  on  comparisons. 

We  devote  the  bulk  of  this  chapter  to  the  first  part  and  develop  a  new  and  detailed 
theory  for  estimatmg  selectivity  of  comparisons,  i.e..  conditions  of  the  lype 

field  =  X 

field  >  X  field  >  X  * 

field  <  X  field  <   X 

Having  selectivity  estimates  for  conditions  of  this  type,  we  can  then  estimate 
selectivity  of  comparisons  which  use  BETWEEN  and  EITHER  operators  by: 

SEL-(>^  BETWEEN  X  &  Y)  =  SEL~( A<y)  -SEL-(A<X) 
SEL~(A  EITHER  X X  )  =     I     SEL~(A=X  ) 

1  K  1 

i=l.K 

We  begin  by  analyzing  how  to  store  information  about  field  values  distribution.  One 
possible  scheme  is  a  histogram:  divide  the  range  of  field  values  into  K  buckets  and 
count  the  number  of  records  falling  into  each  buckeL  We  show  that  this  scheme  is 
not  very  good  for  estimating  selectivity,  since  it  frequently  leads  to  estimates  not  much 
better  than  those  obtained  by  random  guessing.  We  argue  that  this  happens  because 
traditional  histograms  control  the  wrong  parameter  -  they  force  buckets  to  have  equal 
width.     Instead,  we  should  have  buckets  of  equal  height. 

This  leads  us  to  the  second  scheme,  called  distribution  steps.  In  this  scheme  we 
collect  the  values  for  a  given  field  from  all  the  records  in  a  file  and  sort  them 
according  to  the  intrinsic  ordering  of  the  domain.     After  choosing  the  number  of  steps 
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S.  we  find  the  S  values  for  that  field  such  that  the  number  of  records  between 
successive  steps  is  the  same.  The  maximum  selectivity  estimation  error  from  distribution 
steps  is  approximately  1/S.     Thus,  it  can  be  arbitrarily  reduced  by  increasmg  S. 

Next,  we  analyze  conditions  that  are  desirable  (in  general)  on  selectivity  estimates  and 
give  the  axioms  of  consistency,  correctness  on  ends  and  monotonicity. 

We  then  examine  selectivity  estimates  based  on  our  distribution  steps  and  give  two  sets 
of  formulas  for  getting  these  estimates  from  the  distribution  steps.  The  first  minimizes 
the  worst-case  estimation  error.  The  second,  used  in  the  FASTSCAN  query  system,  has 
a  much  smaller  average  error  in  important  cases,  at  the  cost  of  a  slightly  increased 
worst-case  error. 

Then,  we  show  how  to  construct  a  close  approximation  of  true  distribution  steps  using 
sampling,  and  examine  the  relationship  between  sample  size  and  the  selectivity  estimation 
errors. 

Lastly,    we   give   the   axioms   for   selectivity    of   £  v  F,    £  <&  F.    not  E.      Following 

[Selinger     79] .     we     give     formulas     for     those     selectivities     under  assumption     of 

independence  between  £  and  F.     We  also  analyze   the  situation   when  £  and  F  have 

common  subexpressions. 

2.1  REVIEW  OF  THE  PREVIOUS  WORK 

One  of  the  first  attempts  at  estimatihg  comparison  selectivity  was  done  by  the 
designers  of  System  R  [Selinger  79] .  Their  scheme  relies  on  the  minimum  and 
maximum  field  values  (denoted  minval  and  maxval).  and  the  number  of  different 
field  values  (denoted  ndiff).     Their  estimates  are: 

SEL-(=X)  =  1  /  ndiff 

X  -  minval 
SEL-(<X)  = 


maxval  -  minval 
maxval  -  X 


SEL-(>X)  = 


maxval  -  minval 
SEL-(fX)  =  SEL-(<X) 
SEL-(>X)  =  SEL-(>X) 
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These  formulas  are  obuined  by  assuming  that  field  values  are  linearly  and  uniformly 
distributed  between  minval  and  maxval.  When,  as  is  often  the  case,  the  distribution 
is  neither  uniform  nor  linear,  the  formulas  can  be  grossly  inaccurate  with  an  upper 
bound  on  the  estimation  error  being  100%:  the  estimate  can  be  close  to  0.  while  the 
actual  selectivity  is  close  to  1.  and  vice  versa. 

Yu   and   others    [Yu  78]    examined   estimation   of    the   number   of    records   satisfying  a 
query   in   the  context  of    text   retrieval.      In   their   model   all   fields  have   binary   values 
(either  0  or  1).  queries  are  conjunctions  of  fields,  and  records  are  stored  in  clusters  of 
similar  records.      They   have  derived  estimates   for   the  number  of   records  in  a  cluster . 
having  K  or  more  fields  in  common  with  a  query. 

Christodoulakis  suggests  histogramming  for  single-field  conditions  [Christodoulakis  81] . 
He    also    derives    a    very    good    estimation    of    sizes    of    joins    and    block    transfers    in 

[Christodoulakis  83].  In  [Demolombe  80]  only  conjunctions  of  equalities  are 
considered  and  the  main  thrust  is  toward  quantitative  estimation  of  inter-  and  inira- 
rclation  field  dependency. 

A  novel  approach  is  taken  in  [Rowe  S3]  -  statistical  functions  are  computed  not 
botlom-up.  but  top-down.  The  functions  (such  as  MIN,  MAX.  AVG)  are  precomputed 
on  a  collection  of  initial  database  subsets,  called  the  database  abstract.  Then  the 
function  values  on  other  subsets  are  estimated  by  inferring  (using  a  production  system) 
their  upper  and  lower  bounds  from  their  values  on  initial  subsets. 

However,  none  of  these  authors  describe  accurate  estimation  of  comparison  selectivity 
with  a  guaranteed  precision  and  none  analyze  the  desirable  conditions  on  the  estimation 
scheme. 

2.2  DESCRIBING  FIELD  VALUE  DISTRIBUTION 

Field  value  distributions  rarely  have  any  closed  functional  description,  such  as  Zipf 
distribution  [Zipf  49]  which  was  observed  (very  approximately)  in  last  names  in  a 
telephone  book  [Samson  83].  In  most  cases  the  only  description  is  the  distribution 
itself  -  the  list  of  values  of  that  field  across  all  records.  Since  that  takes  too  much 
space,  we  should  look  for  more  compact  schemes.  We  examine  two  alternatives  in  the 
following  subsections.  Our  criterion  for  comparison  is  the  worst-case.  (i.e..  maximum 
possible)  estimation  error. 
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2.2.1  Scheme  1:  a  Histogram 

One  possible  approach  is  to  use  a  histogram.  We  divide  the  range  of  field  values 
into  K  equal-widlh  buckets  and  count  the  number  of  values  falling  into  each  bucket. 

Let  us  consider  a  (hypothetical)  distribution  of  field  AGE  in  a  file  with  information 
on  100  employees  of  a  small  software  company.  AGE  ranges  from  10  (super-bright 
whiz  kid)  to  60  (company  president).  Let  the  histogram  divide  the  range  into  10 
buckets  (see  next  figure). 


#  of 
enployees 


40        40 
XXXX   XXXX 
XXXX    XXXX 
5      XXXX    XXXX      5              5 
1       XXXX    XXXX    XXXX   XXXX      XXXX       1111 
+-- 

10         15        20        25        30        35         40        45        50        55        60 

AGE 

Figure  2-1:      Scheme  1:  Histogram 

A  histogram  is  cheap  to  compute  since  it  requires  only  a  single  pass  through  U:c  file. 
Its  precision,  however,  is  not  always  good.  For  example,  all  it  tells  us  about  SEL(<29) 
(the  percentage  of  the  employees  who  are  younger  than  29)  is 

~  0.46  1  SEL(<29)  <  0.86 

The  true  fraction  of  records  with  AGE  <  29  is  anywhere  from  0.46  to  0.86.  If  we 
take  our  estimate  of  SEL(<29)  as  the  mid-point  in  this  range  (0.66).  then  the  estimate 
can  be  wrong  by  0.20. 

The  maximum  error  in  estimating  SEL(<X)  is  half  the  height  of  the  bucket  in  which 
X  falls.  For  an  unlucky  distribution  of  field  values  (where  the  tallest  bucket  contains 
almost  100%  of  the  records)  a  selectivity  estimate  from  a  histogram  can  be  wrong  by 
almost  0.5." 

Such  a  situation  is  quite  common.  For  example,  for  SALES  in  the  Industrial 
Compustat  file,  the  minimum  value  is  0  and  the  maximum  value  is  108.108  million. 
Yet  95%  of  the  companies  have  SALES  <  2.613  million,  which  is  2.4%  of  the  maximum 
value.  So  if  we  divide  the  range  of  SALES  into  10  or  20  or  even  40  equal  steps, 
almost  all  records  will  fall  into  the  first  bucket  Then  for  any  X  <  2.613  million,  the 
maximum  estimation  error  on  SEL-(<X)  is  close  to  0.5  . 


0,5 
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From  this  discussion  we  see  that  the  way  to  control  the  maximum  estimation  error  is 
to  control  the  height  of  each  buckeu  Thus,  we  should  try  to  create  a  "histogram" 
where  all  buckets  have  equal  height  (instead  of  equal  width).  Such  scheme  is  described 
in  the  next  section. 

2.2-2  Scheme  2:  Distribution  Steps 

Consider  a  field  A  in  some  file.     Distribution  steps  STEP(O).  STEP(l) STEP(S)  are 

values  such  that  the  fraction  of  records  with  A  <  STEP(/)  is  less  than  or  equal  to  l/S. 
For  example,  suppose  we  choose  5=10.  STEP(O)  is  the  minimum  value  of  A.  and 
STEP(IO)  is  the  maximum  value  of  A.  STEP(3)  is  a  value  such  that  about  30%  of  all 
records  ha\'€  A  <  STEP(3). 

Practically,  we  compute  distribution  steps  by  the  following  procedure: 

1.  Collect  the  values  of  A  from  all  the  records  in  the  file  and  sort  them  in 
ascending  order  according  to  the  intrinsic  ordering  of  the  domain.  We  note 
that  this  ordering  must  exist  and  be  unique,  for  otherwise  comparison  A  <  X 
is  not  meaningful. 

2.  Select,  depending  on  the  desired  accuracy  and  available  storage,  the  number 
of  distribution  steps  S.  Select  5*1  positions  (including  the  first  and  the  last) 
in  the  sorted  list  of  field  values,  such  that  there  is  the  same  number  of  field 
values  between  any  two  successive  positions.  These  positions  are  1,  l~N, 
1*2N l+(5-l)«/V.  1*5*/V=#/?.  where  A/  =  (#/7-l)/5. 

3.  Take  values  found  in  these  positions  in  the  sorted  list  of  all  values  and  let 
them  be  the  distribution  steps  STEP(O).  STEP(l) STEP(5). 

From  the  construction  process  we  see  that 

for  every  step  /. 

the  number  of  records  with  A  <  STEP(/) 

is  <  l*N. 

The  data  from  the  previous  example,  as  it  would  be  represented  using  10  distribution 
steps,  is  shown  in  the  following  figure. 

To  estimate  the  same  selectivity  -  SEL(<29)  -  we  first  find  where  "29"  falls  relative 
to  distribution  steps.     Since 

STEP(8)  =  28 
we  know  that  more  than  80  employees  are  28  or  younger,  so  SEL(<29)  >  0.80.     Since 


II    iliai    1-    n.u    [-<->iblc.    i»c    cjn    tijic    icvicr    ijIuc-    bciwi-^-n   ju>i    ilic    Ij>i    !».■   pc-rii.-n-        All    Ww   i.'li..»iiM    NtuiuIj-    .>.'uIo 
Niill    holJ 


Chapter  2 


-  17  - 


Esiimaiing  Selectivity 


AGE 


21 
^?   i 


STEP*   0 
1 


34 


22 

I 

I 


25 


26 


23 


27 


28 


28 


11    21    31    41    51    81    71    81    91 
Position  In  th«  sorted  Hst  of  field  values 


60 


10 
100 


Figure  2-2:      Scheme  2:  Distribution  Steps 

STEP(9)  =  34 

we    know    that    90    or    fewer    employees    are    younger    than    34,    so    SEL(<29)    <    0.90. 
Therefore 

0.80  <  SEL(<29)  <  0.90 


Again  choosing  the  midpoint  of  the  range  (0.85)  as  our  estimate  of  SEL(<29).  the 
maximum  possible  error  is  0.05,  4  times  less  than  in  scheme  1. 

The  error  in  estimating  SEL{<X)  depends  on  where  X  falls  relative  to  the  distribution 
steps.  If  X  falls  between  the  steps,  (as  X=29  above),  then  the  maximum  error  is  half 
of  the  fraction  of  values  falling  between  the  steps,  i.e.,  <  1/2S  . 

If  X  is  equal  to  one  or  more  step  values,  then  (as  we  will  show  later  in  this  chapter) 
the  maximum  error  is  about  1/S  .  In  both  cases,  the  maximum  error  can  be  reduced 
to  an  arbitrarily  small  number  by  choosing  a  sufficiently  large  S. 

The  exact  computation  of  distribution  steps  is  more  expensive  than  that  of  a 
histogram  because  it  requires  sorting  of  all  field  values.  A  cheaper  way  to  compute 
distribution  steps  is  to  use  an  index  on  the  field,  if  it  exists.  Finally,  a  very  cheap 
method  can  be  used  when  we  do  not  need  the  completely  guaranteed  accuracy  of  this 
scheme.  Then  we  can  closely  approximate  the  distribution  steps,  retaining  high  average 
accuracy,  by  using  only  a  small  sample  of  the  records  (see  section  2.7). 

In  the  remainder  of  this  chapter  we  will  assume  that  information  about  field  values  is 
stored  in  the  form  of  distribution  steps. 
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2.3  CONDITIONS  ON  THE  SELECTIVITY  ESTIMATES 

Distribution  steps  provide  upper  and  lower  bounds  on  selectivity  estimates  SEL-(=X). 
SEL-(<X).  SEL-(>X).  SEL-(<X).  SEL-(>X).  There  are  many  ways  to  compute  the 
estimates  wiihm  these  bounds.  One  approach  is  to  compute  the  estimates  for  each  of 
ihe  five  types  of  conditions  independently.  We  choose  not  to  do  thau  Instead  we 
postulate  the  following  desired  relationships  between  the  selectivity  estimates  (these 
conditions  are  satisfied  by  the  actual  selectivity): 

Consistency.     For  every  X, 

{CD     SEL-(<X)  ♦  SEL-(=X)  *  SEL-(>X)  =  1 
(C2)     SEL-(iX)  =  SEL-(<X)  *  SEL-(=X) 
(C3)     SEL.{>X)  =  SEL-(>X)  +  SEL-(=X) 

Additional  desirable  conditions,  also  satisfied  by  the  actual  selectivity,  are: 

Correctness  on  Ends.     Let  minval .  maxval  be  the  minimum 
and  ma.\imum  values  for  the  field,  respectively. 
We  require 

(El)    SEL.(>/77axv/a/)  =  0 
(E2)    SEL-(</77//7V3/)  =  0 

Monotonicity. 

For  every  X.  /.  such  that  X  <  Y 
(Ml)     SEL-(<X)     <     SEL-(</) 

For  every  X 

(M2)     SEL-(=X)  >  0 

We  will  consider  cons/stency.  correctness  on  ends  and  monotonicity  as  the 
axioms  of  selectivity  estimates.  We  believe  that  any  estimation  scheme  should  satisfy 
them.  When  an  estimation  scheme  satisfies  these  conditions  we  can  infer  many  other 
properties  of  the  estimates,  for  example: 

-  From  (C2).  (CD  and  (El)  we  can  infer  SEL-(</77ax\/a/)  =  1 

-  From  (C3),  (CD  and  (E2)  we  infer  SEL~(>minval)  =  1 

-  From  (M2)  and  (C2)  follows  SEL-(<X)  t  SEL-(iX)  . 

-Another  desirable  condition,  satisfied  by  the  actual  selectivity,  is: 
Monotonicity-LE.     For  every  X.  /.  such  that  X  <  Y 

SEL-(<X)  <  SEL-(</) 
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However,  when  combined  with  (C2)  and  (Ml),  it  forces  SEL-(=X)  =  0  everywhere.     We 
have  decided  that  this  effect  is  undesirable  and  so  we  allow  that 

SEL-(<X)  >  SEL-(</) 

but  only  when  /  -  X  is  small. 

There  are  many  ways  to  compute  selectivity  estimates  which  meet  the  conditions  of 
correctness  on  ends  and  monotonicity.  In  the  next  section  we  will  present  one  way 
which  uses  the  distributions  steps  and  minimizes  the  worst-case  error. 

« 

2.4  SELECTIVITY  ESTIMATES  WITH  THE  SM.ALLEST 
WORST-CASE  ERROR 

For  any  X.  we  want  to  estimate 

SEL(=X)  . 

SEL(<X)  .      SEL(iX) 

SEL(>X)  ,      SEL(>X) 

The  distribution  steps  give  us  upper  and  lower  bounds  on  selectiviiies.  For  example, 

if   STEP(/)   <    X   <   STEP(/+1)   then   we   know   that   in    the  sorted   list   of  all   the   field 

values,   the  value  in  position   l+/»yV  is  STEP(/)  (where  N  =   (»R-\)/S).  Hence  there 
are  at  least  1+/»A^  values  less  than  X.  and 

SEL(<X)  >  {l*h»R/S)/»R  >  l/S. 

Likewise,    the    value    in    position    l+(/+l)»A/    is   STEP(/+1)    -    hence    there   are   at    most 
(hl)*N  field  values  less  than  X.  and  SEL(<X)  <  (/*1)/S.     So 

l/S  <  SEL(<X)  1  (M)/S 

Also,  there  are  ^R/S-l  "intermediate"  values  between  these  steps.  They  can  all  be 
equal  to  X  (then  SEL(=X)  =  l/S  -  l/#/?).  or  none  could  be  equal  to  X  (then 
SEL(=X)  =  0).     So 

0  <  SEL(=X)  <  l/S 

Within  these  bounds  we  want  to  find  estimates  SEL-  that  are  consistent,  correct  on 
ends,  and  monotonic. 

Let  MaxErr(SEL-(/'e/-op  X))  be  the  maximum  possible  estimation  error  for 
SELArel-op  X).  Formally,  this  is  the  maximum  difference  between  ShUrel-op  X) 
and  SEL-(/-e/-op  X). 
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Then,  for  any  X.  we  want  to  mininiize 

max(  MaxErT(SEL-(=X)). 
MaxErT(SEL-(<X)). 
MaxErr{SEL-(<X)). 
MaxErT(SEL-(>X)). 
MaxErT(SEL-(>X))     ) 

Since 

SEL-(iX)  =  1  -  SEL-(<X) 

SEL-(>X)  =  1  -  SEL-(fX) 
we      have      MaxErr(SEL-(>X))      =      MaxErr{SEL-(<X)).      and      MaxErr(SEL-(>X))      = 
MaxErr(SEL-(fX)).        Therefore    it    is    enough    to    minimize    MaxErr    for    SEL-(=X). 
SEL-(<X).  SEL-(<X).     Furthermore,  since 

SEL-(iX)  =  SEL-(<X)  *  SEL-(=X) 

we  only  need  to  find  formulas  for  SEL-(<X)  and  SEL-(=X)  .  These  formulas  depend 
on  where  X  falls  relative  to  the  distribution  steps.  In  the  following  subsections  we 
analyze  the  different  cases. 

2,4.1  Case  A:  X  is  between  the  steps 

1  STEP(1+1) 

field  X    0                                                       Here 

value     STEP(I)  !    H 

0  \        0                                  STEPd  )    <   X   <   STEP(I  +  1  ) 

I                 0  \        0 


stepif        I  1  +  1 

We  know  that  //S  <  SEL-(<X)  <  (M)/S 

The  estimate  with  the  lowest  maximum  error  is  (/+0.5)/S.  However,  consider 
SEL-(fX).  For  it  the  best  estimate  is  also  (/*0.5)/S.  Yet.  from  consistency 
requirement, 

SEL-(iX)  =  SEL-(<X)  *  SEL-(=X) 
Thus  we  can  write  that 

SEL-(<X)  =  (/*0.5)/S  -  a*SEL-(=X) 
SEL-(iX)  =  (/*0.5)/S  *  (1  -  a)*SEL-(=X) 

where  0  <   a  f  1.     The  maximum  error  for  SEL-(<X)  is  then 

0.5/S  +  a'SEL-(=X) 

and  for  SEL-(<X)  is 
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0.5/S  +  (l-a)«SEL-(=X) 

To  minimize  the  maximum  of  those  errors,  a  should  minimize  max(a,  1-a).  Hence  a 
=  0.5  and 

SEL.(<X)  =  (/+0.5)/S  -  0.5»SEL-(=X) 
SEL-(iX)  =  (/+0.5)/S  +  0.5»SEL-(=X) 

with  the  maximum  error  0.5/S  +  0.5*SEL-(=X)  in  each  case.  Let  ;3  =  SEL-(=X)  . 
Since  0  ±  0  <  1/S,  the  error  in  estimating  /3  is  max(/J,  1/S-^).  The  maximum  error 
overall  is 

max(0.5/S  +  0/2,  /3.  1/S  -  0) 

This  is  minimized  vvhen  0.5/S  +  0/2  =  1/S  -  0  ,  or  when  0  =  1/3S  .     Thus 

/  +  1/3                                            1/3 
SEL-(<X)  =    SEL-(=X)  = 

s  s 

with  the  maximum  estimation  error  =  2/3S. 

2.4.2  Case  Bl:  X  is  equal  to  one  of  the  steps,  but  not  to  the  first  or  last  step 

j  STEP(I+1) 

field  X-STEPd)        # 

value  STEP(l-l)  *  # 

#  #  (C       • 

j  #  #  # 

+- ------------------------ 

step#     1-1  I  X-^1 

Here  STEP(/-1)  <  X  =  STEP(/)  <  STEP(/+1)  .     Then. 
(/-1)/S  <  SEL-(<X)  <  //S 

//S  <  SEL-(iX)  1  (M)/S 

As  in  case  A.  we  can  say 

SEL-(<X)  =  //S  -  a*SEL-(=X) 
SEL-(iX)  =  //S  *  (l-a)»SEL-(=X) 

with  the  maximum  estimation  error  for  each  formula  being 

1/S  -  min(a.  1-a)  •  SEL-(=X) 

which  is  less  than  1/S  .  To  minimize  the  worst-case  error,  we  should  set  a  =  0.5. 
Since  0  <  SEL-(=X)  <  2/S.  the  minimum-error  estimate  for  SEL-(=X)  =  1/S  .  The 
final  equations  are: 

/  -  0.5  1 

SEL-(<X)  =    SEL-(=X)  =  — 

s  s 

with  the  maximum  estimation  error  =  1/S. 
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2.4.3  Case  B2:  X  is  equal  to  several  steps,  but  not  to  the  first  or  last  step 


field 

value 


X»STEP(I) 

STEP(I-I)  0 

»  0 

0  0 


X 

0 

0 
0 


STEP(1+1) 

0 

0 

0 
0 


1-1 


I 


I+K-l         I+K 


steptf 

Here  X  is  equal  lo  values  of  K  sieps  (K>1)  : 

STEP(/-1)  <  X  =  STEP(/)  =  ...  =  STEP(/+A'-1)  <  STE?{/+K) 
By  reasoning  analogous  to  Case  Hi  we  get 


/  -  0.5 


SEL-(<X)  = 


K 
SEL-(=X)  =  — 
S 


with  the  maxinium  estimation  error  =  1/S 


2.4.4  Case  C:  X  is  equal  to  one  or  more  steps,  including  either  the  first  or  last 
step 

We  note  that  if  X  is  equal  to  both  first  and  last  step.  i.e..  X  is  equal  to  minimum 
and  maximum  field  values,  then  all  values  are  equal  to  X.  in  which  case  estimation  is 
trivial. 


If  X  IS  equal  to  K  steps  (/C>1).  including  the  first  one: 


field 
value 


STEP(K) 
X«STEP(0)  X»STEP(K-1)       0 

0  0  0 

0  ...  *  * 

0  0  0 


Step*  0  ...  K-1  K 

X    »    STEP(O)    «...»>    STEP(K-1)    <    STEP(K) 


then 


SEL-(<X)  =  0 


K  -  0.5 


SEL-(=X)  = 


If  X  is  equal  to  K  steps  (/C>1).  including  the  fast  one: 


field 
value 


X=STEP(S-K+1)  X»STEP(S) 
STEP(S-K)     0  0 

0  0  ...       0 

0  0  0 


Step*     S-K     S-K+1 

STEP(S-K)  <  X  X  STEP! S-K+1 1 


STEP(S) 


then 
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K  -  0.5 
SEL-(<X)  =  1 


2.4.5  Case  D:  X  is  outside  the  range  of  field  values 

If  X  <  STEP(O),  then  SEL-(<X)  =  0 

SEL-(=X)  =  0 

If  X  >  STEP(S).  then         SEL-(<X)  =  1 

SEL-(=X)  =  0 

2.5  SELECTIVITY  ESTIMATES  WITH  LOWER  AVERAGE 
CASE  ERROR 

In  the  previous  section  we  presented  a  set  of  formulas  which  satisfied  all  the  axioms 
and  for  each  X  minimized  the  worst-case  estimation  error  for  SEL-(/"e/ -op  X). 
The  major  disadvantage  of  those  formulas  is  that  for  X  between  the  steps.  SEL- (=X)  = 
1/3S  is  usually  much  higher  than  the  actual  SEL(=X). 

In  this  section  we  present  another  set  of  formulas  (also  satisfying  all  the  axioms)  with 
a  significantly  smaller  average  estimation  error  for  SEL-(=X)  for  X  between  the  steps. 
The  penalty  incurred  is  the  somewhat  greater  worst-case  estimation  error.  In  many 
applications,  however,  minimizing  the  worst-case  error  is  less  important  than  minimizing 
the  average-case  error.  We  find  this  to  be  especially  true  when  using  these  selectivity 
estimates  to  choose  an  evaluation  method  for  a  query. 

The  formulas  in  this  section  are  based  on  distribution  steps  and  on  an  additional 
parameter  called  the  field  density.  These  formulas,  with  some  additions,  are  used  m 
the  query  optimizer  of  FASTSCAN  query  system.  We  do  not  make  any  formal  claim 
about  the  average  estimation  error  of  these  formulas.  In  practice,  however,  they  have 
proved  to  be  very  accurate  -  even  more  accurate  than  the  formulas  from  the  previous 
section. 

Let  /V  be  the  number  of  occurrences  of  value  value  for  a  particular  field 
throughout  a  file. 

Definition  2.2:  The  field  density  is  defined  as 
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.     Z  {N-)  /  #/?-' 

I 

for  all  va/ue   except  ihose  which 
are  equal  lo  at  least  two  step  values. 

Less  formally,  the  field  density  is  the  average  fraction  of  values  that  are  the  same. 
If  there  is  only  one  value  for  the  field  in  all  the  records,  the  field  density  is  1.  If 
each  record  has  a  different  value  in  the  field,  the  density  is  1/#R  If  one  value 
occurs  in  half  the  records,  and  two  other  values  in  one  quarter  each,  the  density  will 
be  between  1/4  and  1/2. 

If  all  values  occur  equally  often,  then  the  density  is  equal  to  SEL{=X)  for  any  X. 
We  can  also  interpret  the  density  as  the  average  value  of  SEL(=X)  over  a  set  of  queries 
A  =  X.  where  the  query  A  =  va/ue  occurs  N  times.  From  these  observations  we 
see  that  density  can  be  used  as  an  approximation  to  SEL(=X).  We  conjecture  that 
density  is  a  better  approximation  than  \/ number _of_different_values  used  by 
System  R.  because  compuution  of  density  takes  into  account  unequal  distribution  of 
field  values- 

We  have  excluded  from  computation  of  density  all  value  which  stretched  across  more 
than  one  step,  i.e..  for  which  N    >  »R/S.     Therefore  density  <  1/S  . 

Let 

5  =  min(0.5/S.  density) 

With  some  exceptions.  5  will  be  the  difference  between  SEL-(<X)  and  SEL-(<X).  5 
is  selected  so  that  the  following  condition  (raonotonicity-LE)  is  partially  satisfied: 

if  X  <  STEP{/)  ±  Y  .  then 
SEL-(fX)  <  SEL-(<VO 

Below   we   present    formulas    for   SEL-(=X).    SEL-(<X)    depending   on    where  X    falls 

relative   to  distribution  steps.      We  do  not  show   the  cases  where  X   is  equal  to  more 

than    one   step,    since   for    those   cases   we   use   the   same    formulas   as   in    the  previous 
section. 

We  have  derived  these  formulas  by  fixing  the  value  of  SEL-(=X)  to  be  5  and  then 
trying  to  minimize  the  errors  for  SEL-(<X).  SEL-(iX)  while  obeying  the  axioms  from 
section  2.3. 
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Case  A:  X  is  between  the  steps 

Here 

STEP(/)  <  X  <  STEP(/-1) 

The  estimates  are: 


/  +  0.5 
SEL-(<X)  = 6/2  SEL-(=X)  =   5 


Case  Bl:  X  is  equal  to  one  of  the  steps,  but  not  to  the  first  or  last  step 

Here 

STEP(/-1)  <  X  =  STEP{/)  <  STEP(/+1) 

The  estimates  are: 

■/ 

SEL-(<X)  = 5/2  SEL-(=X)  =   5 

S 

Case  Cl:  X  is  equal  to  either  first  or  last  step 

• 

If 

X  =  STEP(O)  <  STEP(l) 
then 

SEL-(<X)  =  0  SEL-(=X)  =5/2 

If 

X  =  STEP(5)  >  STEP(S-l) 
then 

SEL-(<X)  =  1-5/2  SEL-(=X)  =5/2 

2.6  A  COMP.ARISON  OF  SELECTIVITY  ESTIMATES  ON  A 
REAL  WORLD  EXAMPLE 

In  this  section  we  compare  the  selectivity  estimates  usmg  the  distribution  steps  of 
field  VOL  -  the  volume  of  stock  trading  on  the  New  York  Slock  Exchange.  The  data 
comes  from  a  single  day  of  trading.  There  are  15049  records  in  the  file,  each  record 
describing  a  single  stock. 

The  field  density  is  0.008  .  There  are  144  distinct  field  values.  6  =  0.008  .  The 
next  figure  shows  a  20-step  field  distribution,  where  nn  %  =  WW  means  that  step  # 
nn/5  has  value  WW.  i.e..  that  nn^c  of  the  records  have  value  <  WW  m  the  field. 
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In  the  following  figure  FASTSCAN  estimate  is  derived  using  formulas  from  section 
2.5.  and  Sm.  w.-c  error  estimate  is  the  smallest  worst-case  error  estimate,  as  derived 
in  section  2.4. 


Sm.    w.-c. 

Actual 

FASTSCAN 

error 

System   R 

condition 

selectivity 

estimate 

est  imate 

estimate 

VOL   <    1500 

0.745 

0.746 

0.725 

0.001 

VOL  -    1500 

0.006 

0.008 

n      AC  A 

0.007 

VOL   >    1500 

0.249 

0.246 

0.225 

0.999 

VOL   <    5000 

0.844 

0.821 

0.8125 

0.005 

VOL  -   5000 

0.003 

0.008 

0.025 

0.007 

VOL   >   5000 

0.153 

0.171 

0.1625 

0.995 

VOL   <    0 

0 

0 

0 

0 

VOL  -   0 

0.576 

0.575 

0.575 

0.007 

VOL   >   0 

0.424 

0.425 

0.425 

1.0 

Figure  2-4:      Comparison  of  selectivity  estimates 

2.7  USING  SAMPLING  TO  APPROXIM.ATE  DISTRIBUTION 
OF  FIELD  VALUES 

Computing  the  distribution  steps  even  for  one  field  of  a  large  file  is  expensive,  since 
it  requires  reading  and  sorting  all  the  values  for  that  field.  Computing  the  distribution 
steps  for  all  fields  in  all  files  (a  typical  example  may  be  50  files  with  100.000  records 
each  with  100  fields  each)  would  take  many  days.  This  might  be  acceptable,  if  the 
database  is  read-only  or  if  it  changes  very  slowly  over  time.  However  there  are  cases 
when  the  database  changes  rapidly,  and  there  is  no  time  to  recompute  all  sutistics  in 
fulL 


A  solution   to   this  problem   is  to   compute  distribution  steps  using  a  sample  of   the 
records,  rather  than  all  of  them.     We  lose  guaranteed  precision  but  we  are  still  able  to 
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maintain   very  good  accuracy.     We  rely  on   the   theory  of  nonparametric  statistics  (see, 
e.g..    [Dixon  79],  chapter  17)  for  analysis  of  estimation  errors. 

KolmogoroVs  statistic  tells  us  that  if  we  take  a  sample'of  size  1064  records  and  /3  is 
the  fraction  of  the  records  in  the  sample  with  field  <  V,  then  with  confidence  99% 
the  fraction  of  the  records  in  the  entire  file  with  fieid  <  V  is  in  the  interval 
[/3-0.05.  ^+0.05]  .  If  we  take  a  sample  of  size  740  then  the  confidence  would 
decrease  to  95%  with  the  same  interval.  The  full  relationship  between  the  sample  size, 
the  confidence  level  and  the  interval  is  given  by  the  uble  of  KolmogoroVs  statistic 
(  [Dixon  79] ,  p.  550).  Interestingly,  the  sample  size  does  not  depend  on  the  number 
of  records  in  the  entire  file. 

We  use  a  sample  of  1064  records.'  We  compute  distribution  steps  using  jusi  the 
records  in  the  sample  and  use  the  same  estimation  formulas  as  we  derived  before. 

Let  STEP  (/)  denote  the  sample  distribution  step  #/.  Consider  X  such  that  STEP  (/) 
<  X  <  STEP  (/*1).     Our  formulas  estimate  SEL(<X)  assuming 

I/S  <  SEL(<X)  <  (/*1)/S 
We  are  no  longer  sure  that  SEL  will  be  within   these  bounds.      We  can.   however,  say 
that  with  confidence  997o 

//S  -  0.05  <  SEL(<X)  <  U*l)/S  *  0.05 

Similar  analysis  applies  to  other  cases.  Thus,  with  99%  confidence,  we  can  say  that 
the  error  in  selectivity  estimates  computed  from  sample  distribution  steps  will  be  no 
more  than  0.05  over  the  error  in  selectivity  estimates  computed  from  the  distribution 
steps  on  the  entire  file. 

In  the  following  figure  we  show  20  distribution  steps  for  field  SALES  from  the 
Industrial  Compustat  file.  The  column  labelled  "all"  refers  to  steps  obtained  from 
reading  all  48320  records  in  the  file,  of  which  38576  were  used  m  computing 
distribution  steps  (the  rest  had  an  undefined  value  of  SALES).  The  column  labelled 
"sample"  refers  to  steps  obtained  from  a  sample  of  1064  records. 

The  following  figure  shows  a  comparison  of  selectivity  estimates.  Column  "Sm.  w.-c. 
est.   on  all"  contains  the  estimates  from  sec.   2.4  (wuh   the  smallest  worst-case  error   ) 
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2.8  SELECTIVITY  OF  LOGICAL  EXPRESSIONS 

In  previous  sections  we  analyzed  estimation  of  selectivity  for  simple  conditions,  called 
comparisons.  Here  we  will  discuss  how  to  estimate  selectivity  of  expressions  formed  by 
connecting  comparisons  wuh  A.ND,  OR  and  NOT  operators.  In  general,  given  some 
logical  expressions  £,  F.  and  their  selectivity  estimates  SEL-(£).  SEL-(F)  we  would  like 
to  find  estimates  SEL-{£  k  F).  SEL-(£  v  F).  SEL-(not  £). 

From  the  definition  of  selectivity,  we  know  that  for  the  actual  selectivity  SEL  : 

0  <  SEL(£  &.  F)  <  min{SEL(£).  SEL(£))  (2.1) 

min(SEL(£).  SEL(F))  i  SEL(£  v  F)  t  SEL(£)  *  SEL(£) 

SEUnoi  £)  =     1  -  SEL(£) 
For  reasons  of  consistency,  we  would  like  selectivity  estimates  SEL-  to  satisfy  the  same 
conditions. 
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The  estimation  of  SEL-(not  £)  is  easy: 
SEL>(not  £)  =     1  -  SEL-(£) 

For  SEL-(£  &  F)  and  SEL-(£  v  F)  there  are  two  cases  -  £  and  F  have  or  do  not 
have  common  subexpressions.     We  examine  these  cases  in  the  next  two  subsections. 

2.8.1  Estimation  of  SEL(E  &  F),  SEL(E  v  F)  when  E  and  F  have  no  common 
subexpressions 

We  recall  that  in  section  1.6  we  assumed  that  no  field  name  is  used  twice  in  "user 
queries.  Therefore,  in  user  queries  "£  &  F"  and  "£  v  F"  E  and  F  cannot  have  any 
common  subexpressions. 

If  we  assume  independence  between  £  and  F  in  the  sense  that 

SEL-(£  &  £)  =  SEL-(£)  •  SEL-(£)  (2.2) 

then  using  SEL-(not  £)  =  1  -  SEL-(£)  and  DeMorgan  Laws,  we  can  derive 

SEL-{£  V  F)  =  1  -  SEL-{not  (£  v  £))  =  !-  SEL-(not  £  L  not  F)  = 

=  1  -  (1  -  SEL-(£))  •  (1  -  SEL-(£))     = 

=  SEL-(£)  +  SEL-(£)  -  SEL-(£)  •  SEL-(£) 

These  formulas  for  SEL-(£  &  £),  SEL-(£  v  F)  are  used  by  System  R   [Seiinger  79]. 

Using  these  formulas,  we  can  estimate  the  selectivity  of  any  logical  expression  made 
of  resolvable  comparisons  connected  by  AND.  OR.  NOT,  where  no  field  name  is  used 
twice. 

If  we  cannot  assume  independence  (eq.  (2.2)).  then  the  best  approach  seems  to  be  to 
use  two-field  distribution  steps,  similar  to  those  presented  in  section  2.2.2.  Such 
distribution  steps  can  be  applied  first  to  estimation  of  selectivity  of  conjunctions  and 
disjunctions  of  comparisons,  and  then  extended  to  apply  to  arbitrary  logical  expressions. 
This  approach  remains  the  subject  of  further  research. 

2.8.2  Estimation  of  SEL(E    &  E  ),  SEL(E    v  E  )  when  E    and  E    have  a  common 
subexpression 

We  will  consider  here  expressions  like 

(£  v  £)  &  (£  V  G) 
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where  E ^  =  (£  v  F)  and  £,  =  (£  v  C).     Such  expressions  are  generated  by  converting 
a  query 

E  \  (F  Sl  C) 

to  the  conjunctive  normal  form  (see  chapter  4).     Let 

e  =  SEL-(£) 
f  =  SEL-(F) 
g  =  SEL-(G) 

If  we  were  to  use  the  formula  (2.2).  then  the  selectivity  estimate  of 
(£■  V  F)  &  (£  V  G) 

would  be 

SEL-(£  V  F)  •  SEL-(£  \  G)  =  (e  *  f  -  e*f)  '  (e  +  g  -  e*g) 
while  the  selectivity  estimate  of  the  logically  equivalent  expression 
E  V  (F  Si  G) 

would  be 

e  +  f*g  -  e*f*g 

These  two  estimates,  in  general,  are  not  equal.  It  is  quite  undesirable  to  use 
estimation  formulas  that  would  produce  different  results  for  equivalent  logical 
expressions. 

We  may  ask,  what  are  the  selectivity  estimate  functions,  satisfying  equations  (2.1).  so 
that  for  any  expressions  £,  F.  G 

SEL-(£  v  (£  &.  G))  =  SEL-((£  v  F)  &  (£  v  G))  (2.3) 

SEL-(£  &.  (F  v  G))  =  SEL-((£  L  F)  v  (£  &  G)) 

MAX  Si  MIN  Selectivity  Estimates  Conjecture:  We  conjecture  that  if 

SEL-(£  &  F),  SEL-(£  v  F)  are  functions  solely  of 
SEL-(£),  SEL-(£) 

and  selectivity  estimates  satisfy  inequalities  (2.1)  and  equations  (2.3)  then  we  must  have 
SEL-(£  <S  £)  =  min(SEL-(£).  SEL-(£)) 
SEL-(£  v  £)  =  max(SEL-(£),  SEL-(F)) 

However,  max  and  min  are  too  crude  as  selectivity  estimates.  Therefore,  to  estimate 
the  selectivity  of  an  expression  like 


Chapter  2  -  31  -  Esiimaling  Seleciivity 

(£  V  F)  &  (£  &  G) 

which  may  arise  from  conversion  to  a  conjunctive  normal  form,  we  first  need  to 
simplify  it  to 

£  V  {F  &.  G) 

and  then  use  the  methods  from  the  previous  section.  Since  we  have  assumed  that  in 
input  query  no  field  name  is  used  twice,  (see  section  1.6),  such  simplification  can  always 
be  done. 
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3.  A  MODEL  OF  QUERY  EVALUATION 

In  this  chapter  we  present  a  model  of  a  single-file  boolean  query  evaluation.  We 
begin  by  giving  an  informal  overview  and  then  treat  the  subject  more  rigorously. 

Our  model  is  based  on  propagating  streams  of  pointers  and  records  through  nodes  of 
a  query  tree.  Streams  are  manipulated  by  several  basic  operations,  such  as  union  or 
intersection.  These  operations,  in  turn,  are  constructed  from  basic  data  operations,  such 
as  open  file,  open  index,  get  record.  A  query  tree  is  the  parse  tree  of  a  query.  A 
query  evaluation  method  (EM)  is  a  sequence  of  basic  stream  operations  applied  at 
query  tree  nodes.  The  output-  of  an  EM  is  a  stream  of  all  records  satisfying  the  query. 
The  EM  cost  is  computed  from  the  costs  of  basic  data  operations  and  cardinalities  of 
its  intermediate  streams. 

For  each  query  tree  our  model  defines  the  space  (called  EM-space)  of  all  possible 
EMs.  In  this  way.  selection  of  the  cheapest  EM  for  a  query  tree  is  reduced  to  search 
of  the  EM-space.  To  get  the  cheapest  EM  for  a  given  query,  we  need  to  consider 
EM-spaces  of  all  the  equivalent  query  trees.     This  problem  is  addressed  in  chapter  4. ' 

For  single-file  boolean  queries  our  model  covers  almost  all  evaluation  methods  used  in 
current  database  systems.  It  can  be  extended  to  multifile  queries  by  adding  operations 
such  as  Sort,  Merge  and  Join  Index  Scan.  In  this  dissertation,  however.- we  will  limit 
ourselves  to  single-file  boolean  queries. 

The  elusive  "optimal"  query  evaluation  has  often  been  described  in  the  literature. 
Usually  lacking  in  those  descriptions  was  a  definition  of  what  is  a  query  evaluation 
method  and  what  is  the  space  of  possible  methods  in  which  to  look  for  the 
optimal  one.  The  following  two  works  are  the  exception.  Yao  in  [Yao  79]  gives  a 
very  good  model  of  two-file  queries,  where  an  EM  is  a  structure  built  from  blocks  like 
"Restriction  Index".  "Join  Index".  "Record  Access",  "Record  Filler",  etc...  The  output 
of  any  such  block  is  a  set  of  pointers  or  records  which  can  be  an  input  to  the  next 
block.  Lorie  and  Nilsson  [Lorie  79]  present  their  Access  Specification  Language,  for  a 
general  (one  or  many-file)  query,  where  an  EM  is  composed  from  scans  (on  index, 
relation,  or  temporary  list)  and  creation  of  temporary  lists,  indices  or  relations.  Their 
model,  however,  assumes  that  any  single-file  query  is  evaluated  wuh  at  most  one  index 
scan.  That  is  usually  not  the  optimal  way  for  evaluating  complex  queries  in  the 
environment  of  off-line  update. 
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Our  approach  is  similar  to  Yao's  and  Lone  &  Nilsson's.  It  is  different  in  using 
streams  rather  than  lists,  in  constructing  EMs  bottom-up  from  basic  data  operations,  in 
defining  the  space  of  possible  EMs,  and  in  givmg  the  algorithm  for  selectmg  the 
optimal  EM. 


3.1  NOTATION  AND  DEFINITIONS 

Every  data  and  stream  operation  is  described  as  a  function,  with  zero  or  more  mputs. 
and  one  or  more  outputs.  For  example,  operation  OP  with  inputs  IN  .  IN,,  IN  and 
outputs  OUT  ,  OUT,  is  written  as 

(OUTj.  OUT,)  =  OP(IN_.  IN,,  IN^) 

Variable  err  is  used  as  an  error  code  and  is  always  0  when  there  is  no  error. 

A  list  of  elements  is  some  static  structure  with  all  elements  stored  in  main  memory 
and  available  for  use.  A  stream  of  elements  is  a  dynamic  process,  which  generates 
one  element  at  a  time.     Only  the  currently  generated  value  is  available  for  use.     For 

example,  the  following  operation  generates  a  stream  of  I  =  1.  2 10. 

do   I    »    1    to    10 

,  .  .    stream  of    I    ... 
•nd  do 

To  use  all  values  of  a  stream,  we  must  first  store  them  in  a  lisL 

R  stands  for  a  stream  of  records.  [/?]  for  a  list  of  records,  and  r  for  a  single 
record.  Likewise,  P  stands  for  a  stream  of  pointers,  [P]  for  a  list  of  pointers,  and  p 
for  a  single  pointer. 

The  cardinality  of  a  list  or  stream  L  is  denoted   \L\. 

Definition  3.1:     A  query  tree  is  the  parse  tree  of  a  query  condition,  with 
condition  constructed  using  the  BNF  grammar  given  in  section  1.6. 


3.2  DATA  MODEL 

Since  we  are  investigating  single-file  queries,  we  will  limit  our  discussion   to  a  single 
data  file.     We  make  the  following  realistic  assumptions: 

1.  The  data  structures  are  a  direct  access  file  and  an  index  (inverted  file).  We 
do  not  consider  other  access  structures,  such  as  threaded  lists  or  cellular  lists. 
because  both  practice  and  research  [Siler  76]  indicate  that  inverted  files  are 
uniformlv  better  in  the  DBMS  environments. 
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2.  Data   and    index    files   are    large    and    reside   on   a   disk.      The   data   file   is 
accessed  directly,  by  record  number. 

3.  The  unit  of  transfer  between  a  disk  and  memory  i<;  a  page. 

Furthermore,  the  data  file  has  #/?  records  which  are  blocked  »b  per  page.  The  total 
number  of  pages  is  #P  =  \»R/ifb\  .  A  tuple  in  a  relation  corresponds  to  a  logical 
data  record.  Each  tuple  has  a  unique  identifier  called  a  TID  or  a  pointer.  In  our 
model  a  pointer  is  simply  a  record  number. 

Indices  are  hierarchical  structures,  like  B-lrees  (for  our  analysis  we  do  not  need  more 
details  about  their  implementation).  Pointers  to  records  are  stored  in  the  index  leaves. 
We  will  consider  only  single-field  indices.  An  important  feature  of  indices  used  in 
database  systems  is  a  fast  "next"  operation  -  that  is.  given  a  position  in  the  index,  we 
should  be  able  to  quickly  find  the  next  (in  the  order  of  key  values)  position. 

3.3  COST  MODEL 

All  operations  in  our  model  are  eventually  reduced  to  basic  data  operations,  as 
defined  in  the  next  section.  Therefore  all  costs  depend  on  the  costs  of  basic  dau 
operations. 

The  cost  of  a  data  operation  is  a  combination  of  1/0.  CPU  and  secondary  storage 
costs.  It  can  be  measured  in  a  variety  of  ways.  If  the  system  is  secondary  storage 
bound  (like  many  of  the  current  personal  computers),  secondary  storage  would  be  the 
primary  cost  measure.  On  the  other  hand,  if  the  system  is  CPU  bound,  then  CPU 
utilization  would  be  the  primary  cost  criterion. 

The  third  measure  of  cost  is  I/O.  I/O  operations  cause  a  lot  of  CPU  overhead  and 
typically  a  database  system  spends  most  of  the  time  retrieving  data.  This  is  especially 
true  in  the  environment  of  a  large  database,  which  is  the  case  for  FASTSCAN. 

Thus  I/O  is  the  primary  cost  component  in  our  environment.  We  will  measure  it.  as 
it  is  usually  done,  in  the  number  of  page  accesses.  While  the  cost  obtained  in  this  way 
is  no  more  than  a  first-order  approximation,  our  experience  indicates  that  it  is  a  good 
approximation,  usually  sufficient  for  correct  selection  of  the  optimal  EM. 

The  cost  of  many  of  basic  data  operations,  such  as  OPEN_FILE  or  GET_RECORD  is 
system-   and    file-dependent.       However,    for    a    specific   system   and    a    file    the   cosi    is 
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approximately  constanL  Rather  than  trying  to  derive  the  cost  of  each  basic  data 
operation  analytically,  we  assume  that  such  cost  can  be  measured  at  system  initialization 
time  and  used  thereafter. 


3.4  BASIC  DATA  OPERATIONS 

Various  I/O  errors,  although  an  important  subject,  are  outside  the  scope  of  this 
dissertation  and  will  not  be  discussed  here.  For  the  restricted  single-file  query  model 
we  need  only  the  following  operations: 

1.  OPEN_FILE  -  opens  the  data  file. 
Cost  CJDF 

2.  CLOSE_FILE  -  closes  the  data  file. 
Cost;  C_CF 

3.  (r,  err)  =  GET_RECORD(p)  -  uses  pointer  p  to  get  a  data  file  page  with 
record  r.  If  the  file  is  open  and  record  r  exists,  returns  r  (wuh  all  its 
fields)  with  err  =  0;  otherwise  r  is  undefined  and  err  >  0. 

Cost  C_GR  -  if  the  page  containing  r  has  lo  be  retrieved  from  disk.  Cost 
=  0  if  the  page  is  in  the  memory. 

Cost  of  getting  K  records  is  usually  less  than  K*C_CR.  and  depends  on  how 
the  records  are  accessed  -  sequentially,  in  random  order,  etc.  The  formulas 
for  different  cases  appear  later  in  this  chapter. 

4.  OPEN_INDEX(A)  -  opens  index  on  field  A 
Cose  CJDKA)  -  depends  on  field  A. 

5.  CLOSE_INDEX(A)  -  closes  index  on  field  A. 
Cost:  C_CI(A)  -  depends  on  field  A. 

6.  (p,  val,  err)  =  GET_FIRST(A.  FCOMPAR,  X) 

where  FCOMPAR  is  "="  ,  ■'>"  .  ">"  .  or  "first".  If  FCOMPAR  =  "first". 
GET_FIRST  searches  the  index  on  field  A  for  the  first  pointer  p\  otherwise 
searches  for  the  first  pointer  where  A  FCOMPAR  X.  If  the  desired  pointer 
is  found,  returns  it,  and  also  sets  va/  equal  to  the  corresf>onding  value  of  A 
and  err=0.  If  the  desired  pointer  is  not  found,  then  p  and  va/  are 
undefined,  and  err>0. 

Cost  C_GF(A)  -  is  the  cost  of  going  from  the  root  of  the  index  to  the 
leaf  node.  It  does  not  depend  on  FCOMPAR  and  X,  if  we  assume  (as  in 
B-trees)  that  the  disunce  from  the  root  is  the  same  for  all  leaf  nodes.  If 
the  index  on  >^  is  a  B-iree  wuh  a  branching  factor  b(A)  .  then 

C  GF(A)  =  C    *  C,  •  log^      #/? 

where  C    and  C^  are  some  system  specific  consianis. 
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7.  (p,  val,  err)  =  GET_NEXT(A)  -  looks  for  a  pointer  p  with  the  /4-value 
val.  If  found,  returns  p  and  val  with  err-^.  If  not  found,  then  p  and 
val  are  undefined,  and  err>^  .  GET_NEXT  can  be  executed  only  after  a 
GET_FIRST  or  a  preceding  GET_NEXT. 

Cost;  C_GN(A) 

3.5  BASIC  STREAM  OPERATIONS 

By  using  streams  instead  of  lists  we  save  time  and  space  wasted  on  writing  and 
reading  intermediate  lists.  Thus  we  will  try  to  use  streams  instead  of  lists  whenever 
possible. 

In  our  model  a  record  includes  its  pointer,  so  a  record  stream  is  rather  a  stream  of 
(pointer,  record)  pairs.  In  this  section  we  describe  the  following  basic  operations  on 
record  and  pointer  streams: 

-  Index  Scan 

-  Data  Access 

-  Data  Filler 

-  Intersection 

-  Union 

-  Sequential  Scan 

For  each  stream  operation  we  show  its  implementation  using  basic  data  operations 
embedded  in  a  FORTR.\N/77-like  programming  language.  For  each  stream  operation 
OP 

S      <—  OPiS    S     ) 

out  in  in 

1  K 

we  derive  the  cumulative  cost  COSKS    )  of  producing  the  stream  S      and  also  the 
Incremental  cost  of  the  operation  OP,  defined  as 

COST(OP(S    ))  =  COSKS    )  -      I  COSKS    ) 

out  out  in 

We  implement  stream  operations  by  breaking  the  code  into  two  parts: 
BEGIN_STREAM  and  NEXT_STREAM.  These  parts  are  always  used  together,  like 
opening  and  closing  parentheses.  The  BEGIN_STR£AM  par:  contains  the  code  ihai 
does   all    the    necessary    initializations    and    gets    the    first   element   of    the   stream.      The 
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NEXT_STREAM  part  contains  the  code  that  gets  the  next  element  of  the  stream  and 

returns   control    to   BEGIN_STREAM.      For   example,   a   stream   of    1=1.    2 100.    is 

generated  by 

1.1  \ 

do  whlla    (I    <    100)  I    BEGIN   STREAM 

straam  of   1    . . .  /  ~ 

1«I*1  \    END_STREAM 

•nd  do  /         ~ 

On  each  call  BEGIN_STREAM  and  NEXT_STREAM  return  a  pointer  (or  a  record) 
which  is  the  current  stream  element  In  addition,  they  return  a  boolean  variable  more. 
which  is  true  as  long  as  there  are  more  elements  in  the  stream. 

3.5.1  Index  Scan 

This  operation,  denoted 

P  <—  \S{comp(A)) 

produces  a  stream  of  all  pointers  to  records  satisfying  the  comparison  compiA)  on  an 
indexed  field  A     Index  Scan  can  be  applied  to  comp(A)  of  "the  following  types: 

A    rel-op  X 

A     BETWEEN  X  AND  / 

A     EITHER  X  .  X X^ 

The  implementation  of  index  scan  is  identical  for  comp(A)  of  the  first  two  types, 
except  in  use  of  a  starting  condition  FCOMPAR  for  GET_FIRST  and  a  do  while 
condition  WCOND. 

WCOND(/A) 

A  =  X 

A  <  X 
A  <  X 

true 
true 
A  <Y 

The  implemenution  of  Index  Scan  : 

open_:noex(/4)  \ 

(p.    tD    •    GET_FIRST(/4,    FCOMPAR)  I     (p,    more)    » 

do   wnne    {err»0   and   WC0ND(/4)     )  |         BEGIN_STREAM(  P  i 

more    »    ( orr=0 )  / 

. . .    stream  P  of    p    ... 


compiA) 

FCOMPAR 

===== 

== 

==== 

========== 

A  =  X 

=  X 

A  <  X 

"first" 

A  <  X 

"first" 

A>  X 

>  X 

A  >  X 

>  X 

A  BET  X 

& 

/ 

>  X 
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(p.    mrr)    •   GET_NEXT(/4)  \ 

•nd  do  I     ( P '    rnora  I    > 

CL0SE_INOEX(/4)  /        NEXT_STREAM(P) 

The  implementation  of  index  scan  on  A  EITHER  X  ,  X^ X    is  similar: 

X_ARRAY( 1)    «    X^ 

X_ARRAY(K)    •    X^ 

0PEN_INDEX(/4) 
do    I    •     1    to   K 

cod«   for    Ind«x   Scan   on  A    *    X_ARRAY(I), 

txjt   without   OPEN_INDEX   and  CLOSE_INDEX 
■nd  do 
CL0SE_INDEXl/4) 

Let   \comp\   be  the  number  of  records  satisfying  the  comparison  comp.     The  cost  of 
computing  P  IS 

C0S1{P)  =  C_0\(A)  *  C_C^(A)  * 

(i  comp  1-1)  •  C_GN(/4)  *  C.CIM) 

Let 

C_INIT(>1)  =  CS>\(A)  "  C_C?(A)  *  C_CI{-4)  -  C_GN(y4)  •     ' 

Then  we  can  rewrite  the  cost  of  ^  as 

COST(P)  =  C_INIT(>1)  *   I  comp  i    •  C_GN(/4) 

Since  IS  has  no  input  streams,  the  cost  of  IS  is  the  same: 

COST(  IS(compM>')  )  =  C_INIT(/A)  +    icomp|    •  C_GN{/4) 


3.5.2  Data  Access 
This  operation,  denoted 
R  <—  DMP) 
takes  a  stream  P  of  pointers  and   produces  a  stream  R  of  corresponding  records.      Its 

implementation: 

:le  \ 

(r,  mope)  =  BEGIN_STREAM( R) 


OPEN_FILE 

( p ,  nwra ) »BEGIN_STREAM( P ) 

do  wh  n • ( more ) 

r  «  GET_RECORD(p) 


.  .  .  stream  R   of 


r  .  .  . 


(p.more)=NEXT  STREAM(P) 
end  do  ~  I   ""•  niore)  =  NEXT_STREAM(  R  i 

CLOSE  FILE  / 


Let  RP  =  number  of  the  retrieved  pages.     Then 
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COST(^)  =  C_OF  +  C_CF  +  COST(P)  +  RP  *  C_GR 
COSKDACP))  =  COST(/?)  -  COST(/=)  = 
C_OF  +  C_CF  +  RP  •  C_GR 

Estimation  of  RP  depends  on  what  other  assumptions  we  make.     If  all  pointers  in  P 
are  consecutive,  then 

RP  =     WPl/^b] 

where  ^b  =  number  of  records  per  page.     If  the  pointers  come  in  random  order  and 
the  data  file  pages  are  not  saved  after  the  first  retrieval,  then 

RP  =   \P\ 

If  data  file  pages  are  saved  in  memory  and  are  not  retrieved  twice,  then 


(  »R  -  #A 


/?p  =  /=  (  1 ) 


(r^ 


\p\) 

One  of  the  better  derivations  of  this  formula  is  in    [Yao  77] .  A  good  approximation  by 
a  closed  noniierative  formula  appears  in   [Whang  83] . 

3.5.3  Dau  Filter 
This  operation,  denoted 

R     <—  DF(/?  .  cond) 

oui  in 

takes    a    record    stream    R      and    checks    its    records    whether    they    satisfv    a    boolean 

in 

expression  cond  on  record  fields.     Those  records  that  do  satisfy  cond  are  outputted  in 
the  record  stream  R    .    The  implementation: 


(r.tnore)  «  BEGIN  STREAM!  R,_)      \ 
—        in 

if  (more)  then 

result  «  cond(r) 

if  r«$u1t«tpu«  then 


( r ,  more )  » 

BEGIN  STREAM(R   ^) 
-        out 


.  .  .  stream  R-^^   o*  f 

end  if 

if  1  (r,  more)  ' 


]-. 


r  «  NEXT_STREAM(R.^)  /    NEXT_STREAM( R^^^ ) 

Since  neither  index,  nor  data  access  are  used,  and  cost  of  evaluating  cond  is  zero  in 
our  model,  we  have 


(p.  more)  * 

BEGIN  STREAM! P) 
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COSKR    )  =  C0ST(/7  ) 

i>ui  in 

COST(DF(/?  .  cond))  =  0 

in 

3.5.4  Intersection 

This  operation,  denoted 

P  <—  INT{P^.P^) 

lakes  streams  P    and  P,  of  pointers  and  produces  the  stream  P  of  all  pointers  that  are 
in  both  P^  and  P,.     It  does  so  by  first  storing  P    in  an  mtermediate  pointer  list   [P] . 

Implementation: 

[P]    *    null 

(p.    inop«P1l    •    BEGIN_STREAM(P1  ) 

do  wni le(moreP1 ) 

add  p   to    [P] 

(p,    morePI)    »    NEXT_STREAM( PI) 
9nd  do 

(p.    inor«)    «   BEGIN_STREAM(P2) 
do  wt-il  !•(  more  ) 

if    p    in    [P]    then 

.  . .  Stream  P   of  pointers  p  .  .  . 

end  if  \ 

(p,  more)  «  NEXT_STREAM( P2 )        [  (p.  more)  - 
end  do  /    NEXT_STREAM(P) 

It  is  also  possible  to  have  intersection  of  record  streams,  but  it  is  always  less  efficient 
than  intersection  of  pointer  streams,  since  a  record  takes  much  more  space  than  its 
pointer. 

Limiting  intersection  operation  to  pointer  streams  allows  an  efficient  implementaiion 
of  this  operation.  A  pointer  in  our  system  is  just  a  record  number.  Thus  intermediate 
pointer  lists  can  be  implemented  as  bitmaps,  or  as  hash  ubies.  Then  the  test  "is  a 
pointer  p  in  the  list  [P]  ?"  takes  a  constant  time  which  is  negligible  compared  to  the 
lime  of  retrieving  the  next  pointer  from  the  index  file. 

We  have  implicitly  assumed  that  we  have  enough  memory  to  hold  the  bitmap,  whose 
size  is  #/?  bits.  This  is  a  reasonable  assumption  since  for  files  of  100.000  records  a 
bitmap  takes  12.500  8-bii  bytes.  On  a  large  machine,  required  to  handle  files  of  this 
size,  there  are  typically  more  than  2.000.000  bytes  available.  Thus  while  memory  used 
for  bitmaps  carries  wuh  it  some  hidden  cost,  ihis  cost  is  not  large  and  can  be  ignored. 

Therefore 
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COST(P)  =  COSJiP^)  *  COSKP,) 
COST(INT(P^.  P  ))  =  0 

3.5.5  Union 

This  operation,  denoted 
P  <—  UNCP^.P,) 

takes  streams  P^   and  P^  of  pxjinters  and  produces  a  stream  P  of  all  pomters  that  are 

in  either  P^  and  P^.     The  implementation,  using  an  intermediate  pointer  list   [P]    : 
[p]  ■  nun 

(p.    tnorePI)    »    BEGIN_STREAM(  Pi  ) 
do  whllednorePl) 

add  p  to   [P] 

(p,  morePI)  «  NEXT_STREAM( P1 ) 
•nd  do  (p,  more)  - 

BEGIN_STREAM(Pj 

(p,  moreP2)  »  BEGIN_STREAM(P2 ) 
do  whi 1e(moreP2) 

add  p  to  [P] 

(p,    inoreP2)    «   NEXT_STREAMi  P2  ; 
end  do 
do  I   =    1    to    I [P] I 

p  '  Pil) 

more   •    (I    <    | [P] |  )  / 

.  . .  stream  P   of  pointers  p  .  .  . 

\  (p.  more)  = 
•nd  do  /    NEXT_STREAM(P) 

Since  computing  union  does  not  involve  any  data  operations, 
COST(P)  =  COST(P^)  +  COST{P,) 

COST(UN(P^.  P,))  =  0 

There  is  a  hidden   cost  of  memory   used   for   pointer   lists,   but.  as  we  argued   in   the 
previous  section,  this  cost  can  be  ignored. 

3.5.6  Difference 

The  Difference  of  pointer  streams  operation  is  denoted 
P  <—  DlFiP^,  P^) 

It  is  implemented  in  a  way  similar  to  intersection  operation. 

Let  res(E)  be  the  stream  which  is  the  result  of  a  logical  expression  E.     We  can  use 
DIF  in  two  cases: 
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1.  To  evaluate  E  AND  NOT  F  nodes,  where  £.  F  are  resolvable  boolean 
expressions,  using  DIF(res(£),  res(F)) 

2.  To  evaluate  compl  AND  compl  nodes,  where  compl  and  neption  of 
compl  are  resolvable  comparisons,  using  DIF(re5(co/77pl).  res(negation  of 
compl)) 

Queries  like  the  first  case  are  infrequent  in  practice.  In  the  second  case,  DIF  is 
cheaper  than  INT  only  if  more  than  half  of  all  the  tuples  satisfy  compl.  Since  users 
infrequently  give  comparisons  satisfied  by  more  than  half  of  the  tuples  in  the  file,  the 
second  case  is  also  rare.  Thus  the  Difference  operation  should  be  used  only  for  only  a 
few  queries  and  the  savings  from  using  it  (as  we  can  infer  from  a  closer  examination) 
would  not  be  drastic. 

The  reason  for  omitting  the  Difference  operation  from  this  study  is  that  by  tolerating 
less-than-optimal  query  evaluation  for  a  small  fraction  of  queries  we  can  build  a 
unifying  theory  for  single-file  query  optimization,  storing  query  frequencies  and  optimal 
index  selection.  Incorporation  of  Difference  into  query  optimization  remains  an  object 
of  further  research. 

3.5.7  Sequential  Scan 

This  operation,  denoted 

R  <—  SS(concf) 

goes  through  every  record  of  the  file,  evaluating  on  each  record  the  boolean  expression 
cond.  The  output  of  SS  is  a  stream  R  of  records  satisfying  cond.  The 
implementation: 


BEGIN_STREAM 


OPEN  FILE  \ 

do  p  «  1  TO  /fR 

(r.  •rr)  =  GET_RECORDip) 

if   cond(r)    then  / 

.  .  .    straatn  /?  of    r    ... 

•ntj    If  \ 

•nd   do  I    NEXT   STREAM 

CLOSE_FILE  / 

Operation  GET_RECORD  will  be  executed  »R  times.  It  will  require,  however,  only 
^P  physical  page  retrievals.  Each  page  will  be  retneved  at  cost  C_GR  when 
GET_RECORD  requests  the  first  record  on  the  page.  Operations  GET_Pl£CORD  to 
retrieve  the  other  records  on  the  same  page  will  have  cost  0.  since  the  page  will  remain 
(will  be  locked)  in  memory.     So. 

COST(SS)  =  COST(/?)  =  C  OF  +  #P  •  C_GR  *  C_CF 
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3.6  CLASSIFICATION  OF  QUERY  TREE  NODES 

To  find  different  EMs  on  a  query  tree,  we  first  need  to  know  how  to  evaluate 
individual  query  tree  nodes.  It  turns  out  that  only  some  query  tree  nodes,  called 
resolvable,  can  be  efficiently  evaluated. 

Definition  3.2;  The  condition  of  a  query  tree  node,  denoted  CON D( node). 
is  the  boolean  expression  corresponding  to  the  query  subtree  whose  root  is 
that  node. 

Definition  3.3:  The  node  result,  denoted  resfnodej.  is  either  a  stream  of 
all  records  satisfying  COND{node)  or  a  stream  of  all  pointers  to  such  records. 

Intuitively,  a  query  tree  node  is  resolvable  if  the  node  result  can  be  computed  by 
some  combination  of  stream  operations  without  accessing  every  record  in  a  file.  A 
node  which  can  be  evaluated  only  by  accessing  every  record  in  a  file  is  called 
unresolvable.     Formally, 

Definition  3.4:  A  query  tree  node  is  resolvable  or  unresolvable  if  it  is 
so  labelled  by  the  following  recursive  algorithm. 

The  algorithm  for  labelling  resolvable  nodes  was  first  given  in  [Astrahan  75] .  Below 
we  present  a  modified  version  of  that  algorithm: 

Algorithm:  Labelling  Query  Tree  Nodes. 
Inpuf  Query  tree.  List  of  indexed  fields 
Output-  Query  Tree  with  all  nodes  labelled 
Method: 

CALL  LABEL(root_of_the_query_tree) 

Procedure  LABEL(NODE) 

LABEL(NODE)  :=  "unresolvable" 

IF  a  NODE  is  a  comparison  of  one  of  the  following  types: 

A  rel-op  X 

A  BETWEEN  X  AND  / 

A  EITHER  Xj,  X, X^ 

AND  field  A  is  indexed 

THEN     LABEL(NODE)  :=  "resolvable" 

ELSEIF  a  NODE  is     "E  &.  F"     THEN 
IF  LABEL(£)="resolvable"  OR 

LABEL(F)="resolvable" 
THEN  LABEL(NODE)  :=  "resolvable" 
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ELSEIF  a  NODE  is     "E  v  F"     THEN 

IF  LABEL{£)="resolvable"  AND 

LABEL(F)="resolvable" 
THEN  LABEUNODE)  :=  "resolvable" 

END  IF 

END  PROCEDURE  LABEL 

An  efficient  impleraenuiion  of  ihis  algonihm  can  be  done  with  a  postorder  [Knuth 
73]    traversal  of  the  parse  tree. 

We  note  that  a  comparison  like 
2  •  /I  +  5  <  45 

on  an  indexed  field  A  is  labeled  "unresolvable".  even  though  it  is  equivalent  to  a 
resolvable  comparison 

/4  <  20 

Such  cases  can  be  handled  by  a  general  inequality  solver.  In  practice,  hovscver,  such 
comparisons  are  very  rare  and  the  benefits  of  an  inequality  solver  do  not  justify  its 
costs. 

3.7  QUERY  EVALUATION  USING  STREAM  OPERATIONS 

In  this  section  we  formally  define  a  query  evaluation  method  (EM),  using  the  concept 
of  stream  operations  on  query  tree  nodes.  Then  we  show  how  to  evaluate  different 
types  of  query  tree  nodes.     Finally,  we  derive  the  formula  for  the  EM  cost. 

Definition  3.5:  An  EM-tree  is  a  query  tree  where  some  nodes  (called 
access  nodes)  are  evaluated  by  stream  operations  so  thac 

1.  the  root  node  is  an  access  node. 

2.  the  result  of  an  operation  on  a  node  is  res(node). 

3.  if  a  node  operation  has  input  streams  they  are  the  results  of  the  nodes 
children. 

Definition  3.6:  An  evaluation  method  (EM)  for  a  given  query  tree  is 
obtained  from  an  EM-tree  by  writing  down  the  node  operations  using 
postorder  query  tree  traversal. 

For  example,  for  the  following  EM-tree. 
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&  "i."    labelled      DF(Da(^),    compl) 

/        \ 
compl  compZ  "comp2"    labelled     P  < —    IS(comp2) 

The  evaluation  method  is 

P  <—  IS(comp2) 

R  <—  DAiP) 

query_resuli  < —  DF(/?.  compl) 

The  code  for  an  evaluation  method  is  generated  by  recursively  generating  the  code  for 
BEGIN_STREAM  and  NEXT_STREAM,  starting  with  the  root  operation. 

In  the  following  subsections  we  show  how  to  evaluate  different  types  of  query  tree 
nodes. 

3.7.1  Root  Node 

An  operation  specific  to  the  root  node  is  the  sequential  scan  of  the  data  file,  as 
defined  in  section  3.5.7.  Although  we  can  apply  the  sequential  scan  to  a  non-root 
node,  that  does,  not  make  much  sense,  since  while  we  are  evaluating  the  condition  ■  of 
that  node  we  can  just  as  well  evaluate  the  entire  query. 

3.7.2  Resolvable  comparison  nodes 

If  a  node  is  a  resolvable  comparison  compfAJ.  then  it  can  be  evaluated  only  by  an 
Index  Scan: 

P  <—  ISicompfAJ) 

3.7.3  Resolvable  AND  node 

Let  node  =  £  &  F  .  We  show  the  possible  ways  to  evaluate  the  node  depending  on 
the  results  of  £  and  F.     The  best  way  is  the  cheapest  of  the  possible  ways. 

Case  1.     £■  is  resolvable  and  F  is  not. 

Case  IP.     P  =  res(£)  is  a  stream  of  pointers.     Then  use 
R  <—  DA(res(£)) 
resinode)  <—  D¥{R.  COND(£)) 
The  cost  of  resinode)  is  equal  to  the  cost  of  obtaining 
P  +  the  cost  of  DAiP).     DF  operation  has  cost  zero.     So 

COST(res(noc/e))  =  COST(£)  -  COST(  DA(res(£))  ) 
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Case  IR.    /?  =  res(£)  is  a  stream  of  records.     Then 
resinode)  <—  DF(res(£).  COND(F)) 

with 

C0ST(res(/70c/e))  =  COST(£") 


Case  2.     F  is  resolvable  and  £  is  noL     The  code,  with 
subcases  2P  and  2R,  is  symmetrical  to  the  code  for  case  1. 

Case  3.     Both  £  and  F  are  resolvable. 

Case  3R?.     res(£)  is  a  record  stream,  while  res(£)  is 
a  pointer  stream.     Then  use  code  for  IR. 

Case  3PR.     res(£)  is  a  pointer  stream,  while  res(£)  is 
a  record  stream.     Then  use  code  for  2R. 

Case  3RR.  Both  res(£)  and  res(£)  are  record  streams. 
Then  either  IR  or  2R  code  can  be  used. 

Case  3PP.     Both  res(£)  and  res(£)  are  pointer  streams. 
Then  use  either  intersection: 

resinode)  <—  lNT(res(£).  res(£)) 
with 

C0ST(res(/7O(ye))  =  COST(£)  -  COSJiF) 

or  code  for  IP.  or  code  for  2P. 


3.7.4  Resolvable  OR  node 

Here   node   =   £  v  £   and    both   £   and   F   must   be   resolvable.      Depending   on    the 
results  of  £  and  F,  there  are  2  cases  : 

1.  both  res(£)  and  res(£)  are  pointer  streams: 

Then  use 

T&s(node)  <~  UN(res(£).  res(£)) 
with 

C0ST(res(/70(ye))  =  COST(£)  +  COST(£) 

2.  at  least  one  of  res(£).  res(£)  is  a  record  stream. 

The  code  for  this  case  is  similar,  except  that  we  need  some  extra  space  (for 
putting  the  record  stream  into  a  list)  and,  possibly,  extra  time  (for  converting 
one  of  the  pointer  streams  into  a  record  stream).  Thus  COST  for  this  case 
is  strictly  more  than  for  the  previous  case.  In  chapter  J  we  will  use  that 
fact  to  show  that  this  case  never  occurs  in  the  optimal  evaluation  method. 
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3.7.5  EM  Cost 

In  our  cost  model,  intersections  and  unions  of  pointer  streams  have  zero  cosi.  Thus 
ihe  cost  of  an  EM  is  the  sum  of  costs  of  Index  Scans  and  Dau  Accesses. 

COST(EM)  =  I  COST(IS)  *  I  COST(DA) 

To  estimate  the  cost  of  an  Index  Scan  on  a  comparison  we  need  to  estimate  the 
number  of  tuples  satisfying  that  comparison.  To  estimate  the  cost  of  Data  Access  on  a 
stream  S.  we  need  to  know  the  size  of  that  stream  \S\.  If  S  =  INT(S  ,S,)  or  S  = 
UN(S|,S,)  .  then   \S\  depends  on  S^  and  S,. 

Computation  of  these  estimates  was  examined  in  the  previous  chapter.  In  this  chapter 
we  have  established  the  space  of  possible  EMs  for  a  query  and  analyzed  the  cost  of  an 
EM.  We  are  ready  now  to  analyze  in  the  next  chapter  the  selection  of  the  optimal 
EM  for  a  query. 
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4.  OPTIMAL  QUERY  EVALUATION 

In  this  chapter  we  describe  how  to  select  an  opiimal  EM  for  a  query,  i.e..  for  a 
retrieval. 

We  assume  an  environment  where  updates  (including  deletions  and  insertions)  are  done 
off-line  or  are  infrequent  compared  to  retrievals.  EMs  exammed  in  this  chapter  may 
use  union  and  intersection  of  pointer  streams.  These  operations  lead  to  a  noticeable 
gap  between  the  time  pointers  are  obtained  from  an  index  and  until  the  time  they  are 
used  to  retrieve  data  records.  Ensuring  that  the  records  are  still  there  may  require  a 
large  amount  of  locking  if  updates  are  frequent.  This  creates  a  substantial  additional 
cost  for  UN  (union)  and  INT  (intersection)  operations.  In  this  dissertation  do  not 
examine  such  costs. 

Notation  and  Conventions:  EM  is  an  abbreviation  for  an  evaluation  method. 
When  we  will  refer  to  a  specific  evaluation  method,  we  will  use  EM ,  with  some 
subscript  /. 

Since  we  deal  with  single-file  queries,  we  can  assume  that  all  the  queries  we  consider 
are  on  the  same  file.  The  Output  fields  of  a  query  do  not  affect  the  query  cost  and 
thus  query  optimization.  Henceforth,  when  we  speak  of  a  query,  we  will  refer  to  the 
boolean  expression  which  is  the  query  Condition.  Letter  Q  will  be  used  for  a  query; 
tree(Q)  will  denote  the  query  tree  of  Q. 

As  defined  in  chapter  3.  an  EM  for  a  query  Q  is  constructed  from  tree(Q).  One  EM 
that  is  always  available  is  a  sequential  scan.  Usually,  we  can  construct  additional  EMs 
using  IS,  INT,  UN,  DA.  DF  stream  operations. 

Definition  4.1:  A  query  Q'  is  logically  equivalent  to  a  query  Q  (denoted 
Q'zQ  )  if  Q'  can  be  obtained  from  Q  by  applying  some  of  the  following 
transformations: 


Commutativitv: 

£  8l'f 

< — > 

F 

&  £ 

E  V  F 

< — > 

F 

V  £ 

Transitivity: 

£'&  (F  &  G) 

<• 

->     (£  &.  F) 

6l  C 

£  V  (F 

V  C) 

<  — 

■>     (£ 

V    F)     V 

C 
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Distributivity. 

E  &  (F  \  C)     <—>    (£  &  F)  V  (£  &  G) 
£  y  {F  &.  G)     <— >    (£  V  F)  &  (£  V  G) 

From  this  definition  we  see  that  logically  equivalent  queries  are  satisfied  by  the  same 
records.  Therefore,  to  compute  the  result  of  Q.  we  can  construct  an  EM  not  only 
from  tree{Q).  but  also  from  tree(Q'),  where  Q  =  Q'. 

The  more  complex  is  Q.  the  bigger  is  the  number  of  logically  equivalent  Q'.  For 
each  such  Q"  there  are  different  possible  EMs  on  tree(Q').  All  of  these  EMs  we  want 
to  consider  when  selecting  an  optimal  EM  for  Q. 

Definition   4.2:   An   opt/ma/   EM    for   Q   is   a   lowest   cost   EM   among   all 
possible  EMs  on  all  tree(Q'),  where  Q'  f  Q. 

We  note  that  there  could  be  several  optimal  EMs  wiih  the  same  cost.  e.g..  they  can 
have  the  same  DA  and  IS  operations,  but  differ  in  application  of  INT  and  UN. 

In  the  next  section  we  show  how  to  select  an  optimal  EM  for  a  conjunctive  query 
-  a  query  which  is  a  conjunction  of  comparisons.  Then  we  analyze  the  possible  ways 
of  evaluating  a  disjunctive  query,  which  is  a  disjunction  of  comparisons.  Combining 
these  results  we  show  how  to  select  an  optimal  EM  for  a  query  in  conjunctive  normal 
form  (CNF),  which  is  a  conjunction  of  disjunctions  of  comparisons.  Then  we  prove 
that  every  query  Q  can  be  converted  to  conjunctive  normal  form  CNF(Q)  so  that  any 
£M    on  Q  will  be  converted  to  £/V/,  on  CNF(Q)  with  C0ST(£/V7,)  <  COST{£M  )  . 

4.1  AN  OPTIMAL  EM  FOR  A  CONJUNCTIVE  QUERY 

Let  us  examine  a  query  that  is  a  conjunction  of  comparisons.  Besides  sequential  scan. 
It  can  be  evaluated  only  by  using  index  scans  (IS)  on  some  of  the  comparisons,  merging 
obtained  pointer  streams  and  applying  the  data  access  DA  and  data  filter  DF  lo  some 
of  the  pointer  streams. 

Theorem  4.3:     An  optimal  EM  on  a  conjunctive  query  Q  is  either  a  SS,  or 
it  can  be  constructed,  in  the  following  order,  from 

1.  one  or  more  ISs.  producing  pointer  streams 

2.  zero  or  more  INTs  on   these  pointer  streams,  producing  a  final  pointer 
stream 
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3.  one  DA  on  the  final  pointer  stream,  producing  a  record  stream 

4.  one  DF  on  the  record  stream,  producing  the  final  record  stream  which 
is  the  query  result 

Proof:  Let  EM  be  an  optimal  evaluation  method  for  Q,  derived  from  tree(Q').  where 
£7  f  Q.  If  EM  is  a  sequential  scan,  the  theorem  is  proven.  Otherwise.  EM.  is 
composed  from  some  combination  of  IS.  INT,  UN,  DA,  or  DF  operations. 

Since  Q  is  a  conjunction  of  comparisons,  the  only  logically  equivalent  transformation 
that  can  be  applied  to  it  is  the  Commutativity.  Hence  Q'  is  also  a  conjunction  of 
comparisons,  but  possibly  in  a  different  order.  Thus  tree(Q')  cannot  have  any  "OR" 
nodes  and  EM    does  not  have  a  UN  operation  which  is  applicable  only  to  "OR"  nodes. 

Clearly,  EM  must  have  one  or  more  IS  operations,  since  they  are  the  only  ones 
which  generate  streams  -  all  other  operations  just  merge  or  modify  streams.  IS 
operations  produce  streams  of  pointers.  The  only  operation  which  can  transform  a 
stream  of  pointers  to  a  stream  of  records  is  DA.  Since  the  query  result  is  a  stream  of 
records,  we  conclude  that  EM    must  have  at  least  one  DA  operation. 

Let's  consider  the  first  DA  operation  in  EM    : 
R    <—  DMP ) 

The  input  to  the  DA  is  a  pointer  stream  P  .  It  can  be  preceded  only  by  IS  and  INT 
operations.  Since  IS  operations  have  no  input  streams,  we  can  move  all  IS  to  the 
beginning  of  EM  without  changing  P  .  Thus,  without  loss  of  generality,  we  can 
assume  that  all  IS  are  evaluated  before  any  of  INT  operations. 

Let  us  consider  an  IS  preceding  the  first  DA  (called  a  "preceding"  IS): 

P^  <—  lS(comp(A)) 

P  is  the  stream  of  all  pointers  to  records  satisfying  comp(A),  and  only  of  such 
pointers.  P  is  the  intersection  of  all  pointer  streams  of  "preceding"  IS.  Let  CONJl 
be  the  conjunction  of  all  comparisons  resolved  by  "preceding"  IS.  Then  P  is  the 
stream  of  all  pointers  to  records  satisfying  CONJl.  and  only  of  such  pointers.  R^  is 
the  stream  of  all  records  satisfying  CONJl.  and  only  of  such  records. 

Let  CONJl  be  the  remainder  of  Q'  after  eliminating  all  comparisons  in  CONJl  (if 
the  remainder  is  empty,  let  CONJl  =  true).     We  thus  have  Q'  =  CONJl  i  CONJl. 
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Let  EM1_BEF0RE  be  the  part  of  EM    before  the  first  DA.     We  can  construct  a  new- 


evaluation  method  EM, 


P    <—  EMI  BEFORE 


/?i  <—  DA(P^)  1   EM, 


/?,  <—  DF(/?^.  C0/^J2)  I 

whose  result,  R^  is  the  stream  of  all  records  satisfying  CONJl  &  C0NJ2,  and  only  of 
such  records.  Since  Q'  =  CONJl  &.  C0NJ7.  /?,  =  res(Q')  and  since  Q'  =  Q.  we  have 
/?,  =  res(Q).     So  EM,  is  an  evaluation  method  for  Q.     We  have 

COSKEM,)  =  C0ST(EM1_BEF0RE)  +  COST(DA(P  )) 

Let  EMI  AFTER  be  the  part  of  EM    after  DA(P  ).     Then 

C0ST(EM_)  =  C0ST(EM1_BEF0RE)  *  COST(DA(P^))  * 
C0ST(EM1_AFTER) 

Therefore  C0ST(EM,)  ±  COST(EM  )  and  EM,  is  an  optimal  evaluation  method  for  Q. 
Since  EM,  is  constructed  from  one  or  more  ISs,  then  zero  or  more  INTs.  then  one  DA 
and  one  DF.  we  have  proven  the  theorem.     Q.E.D. 

4.2  EVALUATION  METHODS  FOR  A  DISJUNCTIVE  QUERY 

A  disjunctive  query  is  a  disjunction  of  comparisons.  It  can  always  be  evaluated  by  a 
sequential  scan.  The  only  other  EMs.  possible  only  if  all  the  comparisons  are 
resolvable,  are  generated  by 

1.  an  IS  on  each  comparison,  producing  a  pointer  stream 

2.  one  or  more  UN  to  merge  these  pointer  streams,  producing  the  final  pointer 
stream. 

3.  a  DA  on   the  final  pointer  stream,   producing  the  final   record  stream   -   the 
query  result. 

There  is  no  point  to  use  a  DA  on  any  but  the  final  pointer  stream. .  since  records  are 
not  needed  for  the  computation  of  a  UN  and,  besides,  a  UN  on  record  streams  is  more 
expensive  than  on  pointer  streams. 

These  EMs  are  different  only  in  the  order  of  IS  and  INT  operations.  Therefore  all 
these  EMs  have  the  same  cost,  since  they  use  the  same  IS  operations  and  have  the  same 
size  of  the  final  pointer  stream  (which  is  the  size  of  the  query  result). 

Thus  any  EM  constructed  by  the  above  procedure  is  an  optimal  one  for  a  disjunctive 
query. 
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4.3  EVALUATION  iMETHODS  FOR  CNF  QUERIES 

Definition    4.4:      A    term   T   is   either    a    comparison    or    a   disjunction   of 
comparisons: 

T  =  comp 
or 

T  =  comp     V    comp^  v    ...  comp 

A   query   is  in  Conjunctive   Normal   Form  (henceforth  CNF)   if   it   is  a 
conjunction  of  terms. 

Without  loss  of  generality,  we  can  assume  that  all  terms  and  hence  all  comparisons  of 

a  query  are  resolvable.     If  ihey  are  not.  the  problem  of  finding  an  optimal  EM  for  the 

query  is  reduced  to  the  problem  of  finding  an  optimal  EM  for  the  subquery  obtained 

by    removing   all    unresolvable    terms.       Once    ihat    is    found,    an    optimal    EM    for    the 

original   query    is   constructed    by    putting   the   result   of    a   subquery    through   a   DF    to 

evaluate  all  unresolvable  terms. 

From  the  previous  section  il  is  clear  that  a  disjunctive  term  in  a  CNF  query  tan  be 
evaluated  either  by  ISs  followed  by  a  UN  of  pointer  streams,  or  as  a  part  of  a 
condition  in  a  DF.  Therefore,  we  can  consider  such  term  to  be  equivalent  to  a  special 
"comparison". 

Then  a .  CNF  query  becomes  a  conjunctive  query  wiih  some  special  "comparisons". 
Applying  theorem  4.3  we  infer  thai  an  optimal  EM  for  a  CNF  query  is  either  a  SS,  or 
it  is  constructed  by  the  procedure  shown  in  the  figure  on  the  following  page. 

Let  COST(n  be  the  cost  of  producing  the  pointer  stream  -  the  result  of  a  term  T. 
where  T  is  either  a  comparison  or  a  disjunction  of  comparisons.  The  result  of  some 
comparison  may  be  used  in  more  than  one  term  -  e.g..  an  EM  for  the  query  (£  v  F) 
&.  (.E  V  C)  can  use  the  result  of  £  twice.  Therefore,  to  get  the  EM  cost  we  compute 
the  set  AC  of  /Access  Comparisons  -  the  set  of  comparisons  used  in  access  terms. 
Then  the  cost  of  an  EM  constructed  by  the  procedure  4-1  is 

I  COSTiP)       *     COST(DA(/=^) 
P  in  AC      ' 

where    COST(DA(/'  ))    is    a    monotonically    increasing    function    of     \F^\-    as    given    in 
section  3.5.2. 

\P  1    can   be   estimated   using   methods   from   chapter   2.      Let   n  be   the   number   of 
terms.     If  we  assume  that  T   are  independent  (in  probabilistic  sense),  then 
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Figure  4-1:       Construction  of  an  Optimal  EM  for  a  CNF  query 

1.  Select    some    of    the    query    terms    (called    access    terms)    by    a    procedure 
described  later. 

2.  For  each  access  term  T   : 

1.  if  the  term  is  a  comparison  connp(A).  generate 

P^<~  IS(compfAJ) 

2.  if  the  term  is  a  disjunction  of  comparisons 

7"  =  comp     V  ...  V  comp 

'i  'k 

a.  then    for   each    comp     not    used    previously    in    an    IS    in    another 

J 
access  term,  generate 

P^    <—  \S(.comp) 

J  J 

b.  Let  rcs(comp  )   be   the   pointer  stream   -   the   result   of   ISicomp ). 

whether  generated   for   this  or   for  previous  access  term.     Generate 
UN  operations  to  merge  all  Tes(comp  )  into  one  pointer  stream  P . 

3.  Use  an  INT  to  intersect  all  P ,  producing  a  final  pointer  stream  P  . 

4.  Use  a  DA  on,  P    to  generate  a  stream  of  records  R  . 

5.  Use  a  DF  on  /?    to  evaluate  the  T    not  among  access  terms.     The  generated 
stream  of  records  is  the  query  resulL 
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"  |/»J   =  #/?  •     n  (lres(n|/#/7)  (4.1) 

i=l.  n 

To  find  the  cheapest  EM  we  can  compute  and  compare  costs  of  all  2"  combinations 
of  tenns  (the  empty  combmation  corresponds  to  a  sequential  scan).  That  is  not  very 
expensive  in  practice,  since  n  is  usually  less  than  4. 

4.3.1   Dominance  between  terms 

We  can  find  the  cheapest  EM  with  much  less  computation  if  we  can  assume  that  eq. 
(4.1)  holds. 

Then,  by  using  the  concept  of  dominance,  we  need  to  compute  the  costs  of  only 
0(n)  combinations,  on  the  average,  instead  of  2".  Dominance  concept  is  also  interesting 
because  it  illuminates  the  structure  of  the  optimal  EM  selection  problem. 

Definition  4.5:     A  term  T  dominates  a  term  T  (written  T  <<  71  if 
1.  r  and  T  have  no  comparisons  in  common 

2.  cosT(r)  <  cosT(r) 

3.   |res(r)|   <   |res(D| 

Lemma  4.6:      Let  T.  T  be  terms  of  a  query  Q.  such   that  T   <<   T.      Let 

EM    be  an  evaluation  method  of  Q  constructed  using  the  procedure  from  the 
previous  section.     If  EM    uses  7".  but  not  T  as  an  access  term,  then  we  can 

construct  EM'  .  a  cheaper  evaluation  method  of  Q. 


Proof:  EM    has  the  following  form: 

...  code _bef ore  ... 

P    <—  res(71 

I 

...  code_after  ... 
R  <~  DA(P^ 

/?P  <~  DF(/?.  T  &.  other _cond) 

where  code_before  and  code_after  could  be  empty.     We  have 

COST(£M^)  =  COSl(code_before)  *  C0ST(71  * 
COSl{code_after)  -  COST(DA(Pp)) 

where  COST(D.A(A'  ))  is  a  monotonically  increasing  function  of    '  P  \    (see  section  3.5.2). 
We  can  construct  EM'    simply  by  interchanging  T  and  T  ; 
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...  code_before 

P^  <—  res(n 

...  code   after  ... 


R  <—  DA(P^ 


R^  <—  DFiR.  T  &  other_cond) 


EM'    will  also  be  an  evaluation  method  for  Q.     Its  cost  is: 

COST(£/Vf' _)  =  COSl{code_before)  *  COSKD  * 
COSl(code_after)  *  COSKDMF^)) 


Since  T  «  T,  COSKD  <  COSKD.  Under  our  assumption.  \F^\  is  proportional  to 
the  product  of  sizes  of  access  terms  (se^  eq.  (4.1)).  Since  |res(r)l  <  |res(n!.  it 
follows  that   \F^\   <    \P^\.     Therefore,  COSKD.MF ^))  <  COSKDA(Pp)).  and 


COSK£M\)  <  COSK£M^) 


Q.E.D. 


The  consequence  of  this  lemma  is  the 

Dominance  Condition:     If  T,  T  are  terms  of  a  CNF  query  Q.  and  T  is  an 
access  term  of  an  optimal  EM  ,  and  T  <<  T.  then  T  is  also  an  access  term 

of  EM  . 
Proof:  If  T  were  not  used,  then  by  Lemma  4.6  we  could  construct  a  cheaper  EM ^. 

So  we  can  topologically  sort  [Knuth  73]  all  T  into  a  directed  graph  using  the 
dominance  relation  and  consider  for  access  terms  only  those  subsets  of  T  which  satisfy 
the  dominance  condition. 

For  example,  if  the  dominance  graph  is: 

T3 
<  <       <  < 

Tl      <<    T2  T5 

<<        << 

TA 

Chen,  instead  of  2"  =  32  combinations,  we  need  to  compute  the  costs  of  only  five  : 
Tl,  T1&T2.  T1&T2&.T3,  T1&T2&T4.  T1&T2&T3&T4&T5.  The  reduction  from  32  to  5  is 
not  negligible,  since  estimation  of  the  EM  cost  requires  access  to  sutistics  files  (see 
chapter  2)  which  may  take  a  noticeable  time. 
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4.3.2  Algorithm  for  selecting  an  optimal  set  of  access  terms 

Summarizing,  we  get  the  following 

Algorithm    for    selecting    an    optimal    set    of    access    terms    for    a    CNF 
query. 

1.  Get   a    "resolvable"    subtree    by    removing   all    unresolvable    T    from    the 
query  tree. 

2.  Estimate  the  cost  and  size  of  pointer  streams  for  each  T   m  the  subtree. 

3.  If   there  are  dominance  relations,   use   them   to   topologically  sort  T    into 
the  dominance  graph. 

4.  Ul     BEST.E.M     =     SS.     BEST_COST     =     COSKSS)     .         Consider     all 
combinations   of   T .      If   a   combination   conuins   two   terms   that   violate   the 

1 

dominance    condition,    go    to    the    next   combination.      Otherwise    compute    the 
cost   C    of    an    evaluation    method    EM'    constructed    from    T    in    the    current 

combination.      If  C   <   BEST_COST.   then   BEST_EM   =  EM'   .   BEST_COST  = 
C. 

When   the  algorithm  stops,   an  optimal   EM    for  a   resolvable  subtree  will   be 
in  BEST_E.V1. 

5.  Add,    as    the    last    operation    in     BEST_E.M,    a    DF     to    evaluate    the 
unresolvable  T   removed  in  the  first  step. 

We  note  that  while  this  algorithm  considers  all  2"  combinations  of  terms,  it  computes 
the  EM  cost  only  for  those  combinations  satisfying  the  dominance  condition.  The 
number  of  those  combinations,  as  our  practice  indicates,  is  0(n}  on  the  average. 


4.4  CONVERTING  A  QUERY  TO  CNF 

In  this  section  we  show  that 

for   every   EM    on   a   query   Q.    there   is  an   EM'     on   a   logically   equivalent 

CNF  query  Q'  such  that  COST{£M'_)  <  COST(£Af,). 
The   consequence   of    this   is   that   an   optimal    EM    for   Q   can    be   constructed    by   first 
convening  Q  to  a  CNF  query  Q'  and   then  applying  the  algorithm  of  section  4.3.2   to 

a. 

Lemma    4.7:         .A.ny    A.ND/OR   query    tree    can    be    converted    to    CNF    by 
applying  bottom-up  the  transformation 

£  V  (F  &  G)  -->  (£  V  F)  k  (£  v  C) 

or  its  mirror  image 
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V  £, 

/   \  /        \ 

E        &  ">  V  V 

/   \  /   \  /  \ 

F        G  E        F      E        G 

Figure  4-2:       CNF  transformation 
(F  &  G)  V  £  —>  {F  V  £)  &  (G  V  £) 

Proof:  In  a  CNF  tree  an  "AND"  node  never  has  a  father  which  is  an  "OR"  node. 
So  if  the  tree  is  not  in  the  CNF,  then  we  must  have  an  "OR"  node  whose  son  is  an 
"AND"  node.  Applying  to  this  node  the  CNF  transformation,  we  will  move  the  "AND" 
node  up.  Since  the  query  tree  has  a  finite  height,  after  a  finite  number  of  such 
transformations  there  would  be  no  "AND"  nodes  whose  fathers  are  "OR"  nodes  -  hence 
the  query  will  be  in  the  CNF.     Q.E.D. 

Lemma    4.8:      After   applying   the   CNF    transformation    to   an   E.M-tree.    it 
remains  an  EM-tree  and  its  cost  does  not  increase. 

Proof:  Let  us  examine  the  part  of  an  EM-lree  (see  figure  4-2)  to  which  the  CNF 
transformation  is  applied.     There  are  3  cases: 

L  the  EM  uses  streams  from  £,  F 

2.  the  EM  uses  streams  from  £,  C 

3.  the  EM  uses  streams  from  £,  F.  G 

In  each  case,  as  we  will  show  later,  there  is  a  corresponding  EM  on  a  transformed  tree. 

Note:  in  the  following  figures,  stream  operations  written  in  the  middle  are  common  to 
EMs  before  and  after  CNF  transformation. 
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Case  1.     Using  EM  on  E,  F. 

We  will  treat  only  the  subcase  where  results  of  both  E  and 
F  are  pointer  streams.     Other  subcases  are  similar. 

I  * 

/  \  /      \ 

E        &  I  V  V 

/    \  /    \         /    \ 

F        0  E         F      E      G 

before  transformation  after  transformation 

EM1  !  EM2 

P1  <--  resiE) 

P2  <--  res(F) 
R3  <--  DA(P2)            I        P3  <--  UN(P1,  P2 ) 
R4  <--  DF(R3,  COnd(G))  R4  <--  DAi P3  i 

Result  <--  UN( R3. DAi PI ) ) [        Result  <--  DF'R4   Condi  E  v  Gl) 

the  costs  are: 

COST(E)    ♦    COST(F)    + 

COSTf    0A(P1)     )    ♦  I  COST(    DA(UN(P1.    P2 ) )     ) 

COST  I    0AIP2)     )  ! 

Since  \Pl\  *  \P2\  >  |UN(Pl.  P2)\.  DA(Pl)  and  DA(P2)  together  will  access  the 
same  or  greater  number  of  records  than  DA(UN(Pl./°2)).     Therefore. 

COST(DA(/=l))  +  C0ST(DAl/'2))   >  COST(DA(  UN(Pl.  P2)  )) 

Moreover,  on  the  left  we  compute  a  UN  of  records  streams  which  is  more  expensive 
than  a  UN  of  pointer  streams  on  the  right.  Since  in  our  cost  model  (see  section  3.5.2) 
the  costs  of  a  union  and  an  intersection  of  pointer  streams  are  zero,  we  conclude  that 
COST(EMl)  >  C0ST(EM2). 


Chapter  4  -  59  -  ,  Optimal  Query  Evaluation 

Case  2.     Using  EM  on  £.  G. 
It  is  symmetrical  to  case  1. 


Case  3.     Using  EM  on  £.  F  and  G. 

Again,  we  will  treat  only  the  subcase  where  the  results  of  E, 
F,  and  G  are  all  pointer  streams. 


V 

& 

/  \ 

/    \ 

E    & 

V        V 

/  \ 

/  \    /  \ 

F    G 

F     E 

G 

before  transformat 

ion 

after  transformation 

EM1 

EM2 

P1 

<-- 

res ( E ) 

P2 

<-- 

res ( F ) 

P3 

<-- 

res(G) 

P4  <--  INT(P2.  P3) 

1 
1 

P4  <--  UN(P1 . 

P2) 

PS  <--  UN{P1,P4) 

1 
1 

P5  <--  UN(P1, 

P3) 

Result  <--  DA(P5) 

1 
1 

P6  <--INT(P4. 

PS) 

I  Result    <--    DA(P6) 

Note  that  here  res(£)  is  used  twice, 
but  it  is  computed  only  once. 

the  costs  are: 

COST(E)    ■••   COST(F)    ••■   COST(G)    + 
COST(    DA(PS)     )  I  COST(    0A(P6)     ) 

Since  \P5\  on  the  left  is  the  size  of  the  query  result,  it  is  equal  to  \Pb\  on  the 
nghL  Therefore  C0ST(DA(/'5))  on  the  left  is  equal  to  C0ST(DA{/^6))  on  the  right. 
making  COST(EMl)  =  C0ST(EM2)  . 
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Combining  these  three  cases,  we  get  the  following  theorem. 

Theorem  4.9:     For  every  EM    on  a  query  Q.   there  an  EM,  on  a  CNFIQ) 
whlth  uses  the  same  indices  with  COSKEM  )  <  COSKEM  ) 

Proof:        By     Lemma    4.7,     we    can     change    Q    to    CNF(Q)     using    only     the    CNF 

transformation.     By  Lemma  4.8.  an  evaluation  method  on  Q  will  become  an  evaluation 

method  on  CNF(Q)  and  its  cost  will  not  increase.     Q.E.D. 

Corrolary:  Therefore,  an  optimal  EM  for  a  query  Q  can  always  be  obuined  from  a 
query  tree  of  a  logically  equivalent  CNfF  query. 

The  cost  of  an  optimal  EM.  as  discussed  in  chapter  3.  is  the  sum  of  costs  of  IS  and 
DA  operations.  In  ihis  chapter  we  showed  that  there  is  only  one  DA  operation  in  an 
optimal  EM.  In  the  next  chapter  we  will  discuss  how  much  information  we  need  from 
each  query  for  the  correct  computation  of  the  cost  of  an  optimal  E.M. 
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5.  STORING  QUERY  FREQUENCIES 

In  this  chapter  we  analyze  how  to  store  information  about  a  set  of  queries  for  the 
purpose  of  selecting  the  optimal  set  of  secondary  indices.  One  obvious  possibility  is  to 
store  the  list  of  all  queries.  Such  list,  however,  not  only  takes  a  lot  of  space  but. 
more  importantly,  is  not  suitable  for  efficient  analysis  by  index-selection  algorithms, 
such  as  presented  in  the  next  chapter. 

Definition  5.1:  The  query  cost  under  an  index  set  4>  is  the  cost  of  the 
optimal  evaluation  method  that  can  be  constructed  by  using  indices  in  <i>. 

Definition  5.2:  A  a  query  frequency  compression  if-scheme)  is  a 
scheme  of  compressing  information  about  a  set  of  queries. 

Definition  5.3:  An  f-scheme  is  admissible  if  it  has  enough  information 
for  correct  computation  of  the  average  query  cost  for  any  set  of  queries 
under  any  index  set 

If  an  optimal  EM   for  a  query  is  always  a  sequential  scan  (whether  because  it  is  the 
only  possible  EM,   or  because  it  is  always  the  cheapest  among  several  EMs).   then   the. 
query  cost  is  not  affected  by  the  choice  of  index  set     Therefore,   we  can  replace  the 
list  of  such  queries  simply  with  the  number  of   them.     This  is  done  in  all  f-schemes 
described  later. 

In  the  remainder  of  this  chapter  we  will  examine  only  the  queries  for  which  the 
optimal  evaluation  method  is  not  always  a  sequential  scan.  We  begin  by  deriving  a 
formula  for  the  query  cost  as  a  function  of  the  index  seL  Then  we  derive  the  formula 
for  the  average  cost  of  a  query  in  a  query  set  We  prove  that  the  average  cost  of 
"similar"  queries  is  the  cost  of  one  such  query  with  the  average  data  and  index  scan 
seleclivities  (defined  below).  Using  this,  we  derive  two  admissible  and  efficient  (under 
■different  assumptions)  f-schemes.  Then  we  describe  a  simple  f-scheme  implemented  in 
FASTSCAN.  called  the  FREQUENCY  TABLE,  which  is  admissible  under  a  "uniformity" 
assumption.  We  argue  that  when  this  assumption  is  not  satisfied,  the  FREQUENCY 
Table  still  allows  the  computation  of  a  good  approximation  of  the  average  query  cost. 
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5.1  THE  QUERY  COST  UNDER  AN  INDEX  SET 

Let  OPT_EM(Q.^)  denote  the  optimal  (under  an  index  set  <I>)  evaluation  method  for 
Q.  constructed  as  described  in  chapter  4.  If  OPT_EM  is  not  a  sequential  scan,  it  uses 
one  or  more  index  scans  and  one  data  access. 

Definition  5.4:  The  Index  Scan  Selectivity  of 
P  <—  ISicompfAJ) 
is  the  selectivity  of  comp(A).  as  defined  in  chapter  2. 

Since    \P\    =  SELiCOmpfA))  •   ifR.  the  cost  of  an  IS  is  (see  section  3.5.1)  : 
QOSl(\S(comp(A))  =  C_INIT(>1)  ♦  Sl.L(comp( A))  •  HR  •  C_GN(/4) 
where  C_INIT(/4)  and  C_GN(/4)  are  constants,  possibly  different  for  each  field  A 

Definition  5.5:  The  Selectivity  of  a  Data  Access  operation 

/?  <—  DMP) 

is    |/?j/#/7.   the  size  of    the   record   stream   produced    by   DA.   divided   by   the 
total  number  of  records. 

We  note  that   \R\   =   \P\   here.     The  cost  of  DA  is  (see  section  3.5.2): 

COST(DA(P))  =  C_OF  *  C_CF  *    \RP\    •  C_GR 

where  C_OF.  C_CF.  and  C_GR  are  constants,  and  RP  is  the  number  of  pages 
retrieved.  In  this  chapter  we  will  make  a  simplifying  assumption  that  \RP\  = 
\P\.  Let  C_INIT_DA  =  C_OF  *  C_CF  .     Then  we  can  simplify  the  previous  formula: 

COST(DA(P))  =  C_INIT_DA  *    \P\    '  C_GR 

Ignoring  for  the  purpose  of  cost  analysis  UN,  INT  and  DF  operations,  we  can  depict 
EM    =  OPT  EM(Q,*)  as: 

P^  <—  lS(comp(A^)) 
P^  <~  lS{comp(A^)) 
R  <—  D.MP^) 

The  cost  of  EM  .  as  we  showed  in  chapter  3,  is  the  cost  of  the  data  access  plus  the 

0 

sum  of  costs  of  index  scans.     Thus 

COSJiEM  )  =  COSKDMP  ))  *  L  COST(IS) 

0  Q 
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In  general,   the  cost  of  an  EM   which   uses  K  index  scans  with  selectivuies  SEL 

SEL^  on  fields  A^ A^,  and  one  DA  with  selectivity  SEL      is 

K  1  K  ■'da 

COST  EM(  {A A  ] ,    {SEL SEL  } .  SEL     )= 

~  1  K  1  K  Da 

Z    [C  INITM)  +  SEL    •  C  GNM)]+  C  INIT  DA  *  SEL       •  C  GR 

~  1  I  —  1  —  —  DA  — 

i=L  K 

Now  we  will  temporarily  make  the 

USE -ALL  ASSUMPTION:  An  optimal  EM  uses  index  scans  on  all 
resolvable  comparisons. 

In  our  experience  this  assumption  holds  for  most  of   the  user  queries.      After  deriving 

an  admissible  f-scheme  under  this  assumption,  we  will  show  how  to  modify  the  scheme 

for  queries  for  which  the  assumption  is  invalid. 

For  further  progress  we  need  to  define  several  key  terms. 

Definition  5.6:  A  subquery  is  obtained  from  a  query  in  CNF  by 
eliminating  one  or  more  conjuncts.  A  subquery  is  resolvable  if  all  the 
comparison  and  logical  nodes  in  the  subquery  tree  are  resolvable. 

Let  Q  be  a  query,  4>  be  an  index  set.  EM     =  0PT_EM(Q,4>).     Then 

-  INDiQ.i)  =    {IND IND  }    is  the  subset  of  indices  of  <t>  used   bv  EM 

IK  ■  Q 

for  evaluating  Q. 

-  IS_SEL(Q.^)  =    {SEL ^^^K^    ^^^  ^^*  selectivities  of   the  corresponding 

index  scans  of  EM  . 

Q 

-  DA_SEL{Q..i)  =  SEL^^  is  the  selectivity  of  the  data  access  in  EM  . 

Then  the  cost  of  a  query  Q  under  an  index  set  <t>  is: 

C0ST_QUERY(Q,4>)  =COST_EM(  IND(Q,<I>).  IS.SELiQ,*).  DA_SEL{Q.<I>)  ) 


5.2  THE  AVERAGE  QUERY  COST 

Let   TOTCNT  be   the   total   count   of   queries   and   L   be   the   list   of   all   queries.      The 
average  query  cost  for  queries  in  L  under  an  index  set  <J>.  is  defined  as 

C0ST_AVG_QUERY(Z.,<1>)  =     I     COST_QUERY(Q,<t>)   /  TOTCNT     = 

Q  \n  L 
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Z  COST_EM  [INDiQ,*).  IS_SEL{Q.<I>).  DA_SEL(Q.*)] 
Q  in  Z. 


TOTCNT 

A  potentially  resolvable  subquery  is  one  iha:  would  be  resolvable  if  all  fields  in  ihe 
file  were  indexed. 

Definition  5.7:  A  Querytype  of  a  query  Q.  denoted  QTfQJ.  is  obtained 
from  the  potentially  resolvable  part  of  the  query  by  replacing  potentially 
resolvable  comparisons  with  their  fields. 

A  querytype  is  like  a  "skeleton"  of  a  query.     For  example,  for  query 
Ql=  "TICKER=-IBM-  &.  SALES/PRETAX>0.1  &  YEAR>79". 

QT(Q1)=    TICKER    &    YEAR.      Query    Q=    "SALES /PRETAX>Q.  10"    has   no   potentially 
resolvable  part,  so  the  querytype  is   []    -  the  empty  set. 

Definition  5.8:     .A  Subtype  is  a  querytype  of  a  subquery. 

For  example,  the  querytype  A  &  8  had  two  subtypes.  A  and  B.  while  A  \  B  hzs  no 
subtypes. 

Definition  5.9:  A  Resolvable  type  of  a  query  Q  with  respect  to  an 
index  set  <J>.  denoted  RT{Q.^).  is  obtained  from  the  resolvable  part  of  the 
query  by  replacing  resolvable  comparisons  with  their  fields. 

For  example,  for  the  same  query  Ql, 

RT(Ql.  *=  (TICKER.  SALES}    )  =  TICKER 

RT(Ql.  *=  {TICKER.  ^TAR}    )  =  TICKER  &.  YEAR 
For  (22=  SALES>45000  v  TICKER=-IBM- 

RT(Q2.  *=  (TICKER,  SALES) )  =  SALES  v  TICKER 

RT(Q2,  *=  (TICKER.  YE.\R})  =    []    .  i.e.,  the  empty  set- 
Let  us  group  together  all  queries  with  the  same  querytype.     Let  0  denote  a  querytype. 
and   {0}    denote  the  group  of  queries  corresponding  to  the  querytype  0.     Let  i>  be  the 
index  set.     Then  RT(0.<l>)  is  the  resolvable  subtype  of  0.  which  uses  indices  IND(©.<i>)  = 

(IND^.   IND, IND  }.      For  every  query  Q  in  group    (0).   IND(Q,<t>)  =   INDi©.*)  . 

Let  us  denote 

-     SEL     =  selectivitv  of  Index  Scan  on  IND    in  OPT  ENKQ,*). 

0  ■  i  - 
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Then  IS  SEL(Q.*)  =   {SEL    ,  SEL    SEL    ) 

—  Q  Q  0 

-  SEL_DA     =  DA_SEL(Q,<I>)  is  the  selectivity  of  the  data  access  in  OPT_EM(Q, 
*)  • 

-  COUNT(0)  =  the  count  of  queries  in  group  {0} 

Let  COST_TYPE(©,*)  be  the  toul  cost  of  all  queries  from  L  with  querytype  0  under 
an  index  set  4>.  We  would  like  to  combine  in  some  way  information  about  different 
queries  with  the  same  querytype.  The  following  theorem,  central  to  this  chapter,  shows 
how  to  do  iL 

Theorem  5.10:  For  a  given  index  set,  the  average  cost  of  all  queries  with 
the  same  querytype  is  the  cost  of  a  query  with  that  querytype  whose  Data 
and  Index  Scan  selectivities  are  the  average  Data  and  Index  Scan  selectivmes 
for  the  querytype. 

COST_TYPE{0)  /  COUNT(©)  = 

COST_EM({IND,....IND  },    (SEUvg SEUvg  } .  SEUvg_DA) 


where 

1.-  K    SEUvg   = 

Q  in   {0} 


For  i=l.-  K    SEUvg   =     I     (SEL   )  /  COUNT(0) 


and 

SEUvg_DA  = 

Q  in   {0} 


SEUvg_DA  =     I     (SEL_DA^)  /  COUNT(0) 


Proof:     Despite  the  formidable-looking  equations,  the  proof  is  very  simple.     We  only 
expand  the  formulas  and  interchange  the  order  of  summation.     Specifically, 

COST_TYPE{0)  =       I       COST_QUERY(Q,<t>)  = 
Q  m    {0} 


I     COST  EM({IND IND  },    {SEL^ SEL^  },  SEL.DA  ) 


Q  m   {0} 


I  [     I  (C_INIT(IND)  +  SEL^    •  C_GN(IND )) 


Q  in    {0}       i=l.K 


*  C_INIT_DA  -  SEL_DA^  •  C_GR  ] 
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interchanging  the  order  of  summation,  we  get 


=     I        [       I     C  INITdND)  *  SEL      •  C  GNdND )  ] 
i=l.K      Q  m   te} 


+         Z     C_INIT_DA  *  SEL_DA^  •  C_GR 
Q  in   [Q] 


We  can  rewrite  the  above  equation  as 

cosT.riTEie)  =  couNT(e)  • 

[     I  (C  INITdND)  *  SELavg    •  C  GNdND )) 
i=l.K 

*  C_INIT_DA  *  SEUvg_DA  •  C_GR     ] 

=  COUNT(e)  « 
COST 

Q.  E.  D. 


COST_EM({IND IND  }.    (SEUvg, SEUvg  } ,  SEUvg_DA) 

1  K  1  K 


Sum  of  COST_TjTE-s  -  costs  of  querytype  groups  -  is  clearly  the  sum  of  costs  of 
all  queries.  Let  T  be  the  set  of  all  querytype  groups.  We  can  rewrite  the  average  cost 
of  a  query  as 

COST_AVG  QUERY!/.,*)  =         I     CGST.riTElG)  /  TOTCNT 

e  in  r 

Finally,  a  break  from  equations  !  To  compute  the  average  query  cost  under  a  g/ven 
index  set  <t>  it  is  sufficient  to  know 

For  each  group  of  querytypes. 
the  number  of  queries 
the  average  selectivity  of  the  data  access 
the  average  selectivity  for  each  index  scan 

Since  sums  are  easier  to  update  than  averages,  we  will  keep  sums  of  selectnities  and 
compute  the  averages  only  when  needed.  Also,  we  do  not  store  the  total  number  of 
queries  -  we  can  always  compute  it  by  summing  numbers  of  queries  over  all  querytypes. 
In  this  way  we  avoid  redundancy  and  its  associated  dangers  of  loss  of  integrity,  update 
bottleneck,  etc. 

To  compute  the  average  query  cost  for  each  index  set  we  need  lo  know  : 
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For  each  querytype. 

The  number  of  queries  wiih  this  querytype 
For  each  resolvable  subtype 

sum  of  data  selectivities 
For  each  resolvable  comparison 

sum  of  index  selectivuies 

In  the  next  section  we  describe  the  proposed  scheme  in  more  detail. 

5.3  AN  ADMISSIBLE  FREQUENCY  COMPRESSION  SCHEME 

We  keep  an  entry  for  each  querytype  of  the  asked  queries,  including  the  empty 
querytype  which  corresponds  to  queries  that  can  be  resolved  only  by  the  sequential  scan. 
In  each  entry  we  keep  : 

1.  Count  of  queries 

2.  Sum  of  selectivities  for  each  query  subtype,  including  the  querytype.  These 
sums  are  used  to  compute  the  data  selectivity  of  this  quer\type  under 
different  index  sets. 

3.  Sum  of  selectivities  for  each  index  scan.  These  sum  are  used  to  compute  the 
index  scan  selectivities  under  different  index  sets. 

We  do  not  keep  the  total  count  of  queries,   but  instead  compute  it  when  necessary  by 
summing  counts  of  all  querytypes. 

For  example,  after  queries 

Ql   -  A>10   i.   B=l    L   C=5  QT(Ql)  =   A  £.    B   i.   C 

Q2  -  A<3   &  C/D>0.9  QT(Q2)  -   A 

Q3   •   A-5   V  C-6  0T(Q3)  -   A  v   C 

QA   -   A-7  QT(QA)  =   A 

we   will   have   the   following  entries  in    the   frequency  scheme   (selectivities  are   assigned 

arbitrarv  values): 
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Querytype  |  Count 


Resolvable  type  |   Sum  of  select ivi ties 


Index  scans:  A 


SEL(A=7)+SEL(A<3)-0.3 
SELU-7)+SEL(A<3)-0.3 


A  &  B  &  C 


A  ir  B  i  C 

A  &  B 

A  i,  C 

B  &■  C 

Index  scans:  A 
B 
C 


SEL(Ql)  -  0.006 
SEL(A>10  i,  B-l)-0.02 
SEL(A>10  i.  C-5)-0.03 
SEL(B-1  &  C-5)=-0.06 
SEL(a>10)=0.01 
SEL(B»l)-0.02 
SEL(C-5)-0.03 


A  V  C 


A  V  C 
Index  scans 


A 

c 


SEL(Q3)-0.U 

SEl(A-5)=0.05 

SEL(C-6)-0.1 


We  can  slightly  improve  this  scheme  by  not  storing  the  index  scan  selectivity  for 
1-field  query  types  (such  as  A),  since  it  is  equal  to  the  data  selectivity.  Thus  we  do 
not  need  to  store     SEL(A=7)  +  SEL(A<3)     twice. 

To  compute  the  average  query  cost  for  some  index  set  <l>.  we  go  over  all  querytypes. 
For  each  querytype  we 

1.  Determine  the  resolvable  (under  <i>)  subtype. 

2.  Determine,  relying  on  the  Use-All  assumption,  what  index  scans  will  be  used 
by  an  optima!  E.M  for  that  querytype. 

3.  Get  the  sums  of  selectivities  for  the  data  access  and  index  scans.  Dividing 
the  sums  by  the  number  of  queries,  we  get  the  average  selectivities. 

4.  Use  these  selectivities  in  COST_EM  function  to  compute  the  average  cost  of 
queries  with  this  querytype.  Multiplying  this  cost  by  the  number  of  queries 
we  get  the  querytype  cost. 

Summing  querytype  costs,  we  get  the  total  cost  of  all  queries.     Finally,  dividing  thai  by 
the  count  of  all  queries,  we  get  the  average  query  cost. 


The  computed   cost  is  correct  since  in  its  derivation  we  followed   the  formulas  above 
that  express  the  cost  of  the  average  query  as  the  sum  of  costs  of  all  querytypes. 
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5.4  COMBINING  STATISTICS  FROM  DIFFERENT 
QUERYTYPES 

The  f-scheme  from  the  previous  section  is  admissible,  but  it  uses  a  lot  of  space.  For 
example,  for  a  conjunction  of  n  terms,  there  are  2"  different  subtypes,  for  all  of  which 
we  keep  the  sums  of  DA  and  IS  selectivities.  In  practice,  queries  rarely  have  more 
than  4  potentially  resolvable  terms,'  and  usually  have  3  or  less,  so  the  above  f-scheme 
would  not  be  very  expensive  to  maintain. 


Some  of  the  entries  in  the  above  f-scheme  may  seem  redundant  For  example,  there 
are  several  entries  for  the  same  query  subtype  if  it  is  a  subtype  of  different 
querytypes.  A  natural  desire  is  to  try  to  combme  the  statistics  for  the  same  subtype. 
This  leads  to  a  simple  f-scheme  using  much  less  space  (discussed  later  in  this  chapter). 
Unfortunately  this  loses  admissibility,  as  we  will  show  in  the  next  example. 

Suppose  we  have  just  2  queries,  Ql  =  "A=\  &.  B=2"  and  Q2  =  "A>3".     The  frequency 

scheme  for  these  queries  is: 

Querytype  |  Count  |   Resolvable  type  |   Sum  of  selectivities 


A   &    B 

1 

A   &   B 

Index   scans:    A 
B 

SEL(Ql) 
SEL(A=1) 
SEL(B=2)   . 

A 

1 

A                                     1       SEL(A>3) 

With  indices  on  A  and  B,  the  average  data  selectivity  of  querytype  A  is  SEL(Ql)  + 
SEL{/4>3).  If  we  combine  the  two  entries  for  A  into  one.  we  will  have  a  simpler 
table: 

Querytype  |  Count  |   Sum  of  selectivities 
A  L   h  I    1    I      SEL(Ql) 


A 
B 


i    1 


SumA  -  SEL(A-l)  +  SEL(A>3) 
SEL(B-2) 


Now   there  is  no  way   to  compute  correctly  the  average  data  selectivity  of  querytype  A. 
since    we    cannot    extract    from    SumA    its    components.      Therefore    we   cannot    always 
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correcUy  compuie  the  average  query  cost  if  we  combine  statistics  from  different 
query  types. 

5.5  AN  ADMISSIBLE  FREQUENCY  COMPRESSION  SCHEME 
WITHOUT  USE-ALL  ASSUMPTION 

In  previous  sections  we  have  operated  under  a  Use- All  assumption  -  that  a  query  is 
evaluated  by  using  all  its  resolvable  comparisons.  From  our  experience  with 
FASTSCAN  query  system,  this  assumption  holds  for  about  90%  of  queries.  For  the 
other  10%  of  queries  an  optimal  EM  does  not  use  all  resolvable  comparisons. 

.\n  argument  similar  to  one  in  the  previous  section  shows  that  without  the  Use-All 
assumption  we  cannot  combine  statistics  for  queries  wiih  the  same  querytype  if  under 
some  index  set  these  queries  have  different  optimal  EMs.  Consider,  e.g..  queries  Ql 
and  02: 

Ql  is  A-1  &  B>50        QT(Ql)  -  A  4,  B 
where  SEL(a-I)  -0.01    SEL(B>50)  -0.5 
for  index  set  {a,B}  the  optimal  EM  uses  IS(a) 

Q2  -  A>20  i  B-2         QT(Q2)  -  A  4.  B 

where  SEL(A>AO)  -0.4   SEL(B=2)  -0.2    SEL(Q2)  =0.08 
for  index  set  {a,B}  the  optimal  Ert  uses  IS(a),  IS(B) 

Their  querytypes  are  the  same.  Under  the  index  set  {A.  B)  the  optimal  EM  for  Ql 
uses  an  index  scan  only  on  A.  while  the  optimal  EM  for  02  uses  an  intersection  of 
index  scans  on  A  and  on  B. 

To  be  able  to  compute  correctly  the  average  query  cost  under  the  index  set  [A,  B) 
we  need  to  store  the  information  that  the  optimal  EM  for  Ql  uses  an  index  scan  only 
on  A. 

In  general,  we  need  to  partition  each  querytype  by  different  optimal  EMs. 

Definition  5.11:  An  EM  equivalence  class  is  a  class  of  queries  wiih  the 
same  querytype  which  have  the  same  optimal  EMs  under  every  index  set. 

Since  EM  equivalence  class  is  a  refinement  of  querytype,  we  can  use  theorem  5.10  to 
show  that  we  can  combine  the  index  and  data  selectivities  for  queries  in  the  same  EM 
equivalence  class. 
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We  will  not  pursue  the  subject  of  EM  equivalence  classes  any  further.  Instead,  we 
close  this  chapter  with  a  description  of  a  very  simple  f-scheme.  used  in  FASTSCAN. 

5.6  FREQUENCY  TABLE  -  A  SPACE-S.AVING  SCHEME  FOR 
FREQUENCY  COMPRESSION 

We  are  not  always  interested  in  the  best  possible  solutions  of  DBMS  problems.  It  is 
sufficient  to  get  a  good  solution  most  of  the  time  and  avoid  bad  solutions.  In  that 
spirit  we  describe  the  following  simple  f-scheme.  which  is  implemented  in  the  current 
version  of  FASTSCAN. 

Although  we  cannot,  in  general,  combine  statistics  from  different  querytypes  (see  sec. 

5.4)  we  can  do  so  if  we  make  a 

Uniformity  assumption.  The  average  selectivity  of  a  resolvable  type  is  the 
same  in  all  querytypes  it  appears. 

In  addition,  we  also  make  the  Use-All  assumption.  With  both  assumptions  we  need 
to  have  only  one  entry  for  each  different  resolvable  type,  saving  a  lot  of  space.  Even 
though  this  scheme  is  not  admissible  in  general,  it  allows  us  to  compute  correctly 
COST_TYPE  for  single-field  querytypes  and  to  compute  a  good  appro.ximation  for 
COST_TYPE  of  other  querytypes. 

The  Frequency  Table  contains  querytypes  and  the  associated  information.  For  each 
querytype  we  keep: 

SumSel  the  sum  of  selectivities  of  all  appearances  of   this  querytype.   whether 

as  a  separate  query  or  as  a  subtype  of  another  query. 

CumCnt  Cumulative   Count,    the    count    of    all   appearances   of    this   querytype. 

whether  as  a  separate  query  or  as  a  subtype  of  another  query. 

IndCnt  Independent  Count,  the  count  of  app>earances  of  this  querytype  only  as 

an  independent  (or  separate)  query. 

for  querytypes  of  one  field  we  also  keep: 

OrCnt  OR  Count  -  the  count  of  appearances  of  this  field  in  a  disjunct,  (i.e.. 

inside  OR). 

OrSumSel  the  Sum  of  selectivities  of  disjuncts  inside  which  this  field  appeared. 

Whenever  a  query  Q  is  asked,  we 
1.  Convert  a  potentially  resolvable  subquery  of  Q  into  a  CNF  graph. 
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2.  Compute  the  query  type  qt  of  the  CNF  graph  and  its  selectivity  sel. 

3.  Insert  or  update  the  qt  entry  in  the  frequency  table,  by 

SUMSEL((7n  =  SUMSEKv^)  *  sel 
INDCNT(c7r)  =  INDCNT(<7r)  -   1 
CUMCNT(£7r)  =  CUMCNK^n  "  1 

4.  Compute  all  subtypes  of  qt  and  selectivuies  of  their  corresponding  subqueries. 
Insert  or  update  the  entry  for  each  subtype  St.  by 

SUMSEL{5r)  =  SUMSEL(5r)  ♦  SEUst) 
CUMCNKsr)  =  CUMCNT(sr)  ♦  1 

5.  If  qt  has  disjuncts.  then  each  disjunct  dis  is  (A    v  -4,  v  ...  A^)        For  each 
A    insert  or  update  its  entry  in  the  frequency  table,  by 

0RSUMSEL(-4  )  =  0RSUMSEL(/4  )  *  SEUc^/s) 
ORCNT(/A)  =  0RCNT{<4)  -  1 

For  example,   after   query   "A   =    10   &   B   >   20   &   C  =   D"   we   update   the   following 
entries: '° 

entry  for  (A  &  B) 

SUMSEL  =  SUMSEL  ♦  SEUA=10  &  B>20)  =  SUMSEL  *  0.02 

INEXTNT  =  INDCNT  *  1 

CUMCNT  =  CUMCNT  +  1 

entry  for  (A) 

SUMSEL  =  SUMSEL  *  SEL(A=10)  =  SUMSEL  ♦  0.1 

CUMCNT  =  CUMCNT  *  1 

entry  for  (B)- 

SUMSEL  =  SUMSEL  -  SEL{B>20)  =  SUMSEL  ♦  0.2 

CUMCNT  =  CUMCNT  -  1 

After  query  "A=20  v  B=20"  we  update 

entrv  for  (A  v  B) 

SUMSEL  =  SUMSEL  *  SEL(A=20  v  B=20)  =  SUMSEL  *  0.3 

INDCNT  =  INDCNT*! 

entrv  for  (A) 

ORSUMSEL  =  ORSUMSEL  *  SEL(A=20  v  B=20)  =  ORSUMSEL  *  0.3 

ORCNT  =  ORCNT+1 

entry  for  (B) 

ORSUMSEL  =  ORSUMSEL  +  SEL(A=2G  v  B=20)  =  ORSUMSEL  +  0.3 

ORCNT  =  ORCNT- 1 


',0 
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After  query  "A=15  &  B/C>0.1"  we  update 

entry  for  (A) 

SUMSEL  =  SUMSEL  +  SEL(A=15)  =  SUMSEL  +  0.15 

INDCNT  =  INDCNT  +  1 

CUMCNT  =  CUMCNT  +  1 


If  these  three  queries  were  the  only  ones  in  the  frequency  table,  it  would  contain: 


r  requenc; 

y  cao 

ie- 

Querytype 

INDCNT 

CUMCNT 

SUMSEL 

ORCNT 

ORSUMSEL 

.>=i.>o>: 

KM  =  S 

=  .:> 

=.======= 

»«s  =  =  » 

,:mm  =  ^m  =  =  =  = 

A 

1 

2 

0.25 

1 

0.3 

B 

0 

1 

0.2 

1 

0.3 

A  &  6 

1 

1 

0.02 

0 

0 

A  or  B 

1 

1 

0.3 

0 

0 

As  in  the  scheme  described  in  the  previous  section,  the  total  count  of  queries 
(TOTCNT)  is  not  stored  but  computed  when  needed  by  summing  INDCNT  for  all 
entries. 
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6.  OPTLMAL  AND  NEAR  OPTIMAL  SECONDARY  INDEX 
SELECTION 

In  chapter  4  we  have  studied   how   to  optimize  a  single  query.      In   this  chapter  we 

attack  the  optimization  of  a  set  of  queries.     That  is,  given  a  set  of  queries  on  a  single 

file,  we  want  to  find  the  optimal  set  of  secondary  indices  which  minimizes  the  average 
query  cost,  as  derived  in  chapter  5. 

Formulation  of  the  optimal  index  selection  problem  (OIS)  depends  on  whether  the 
index  update  is  done  on-line  or  off-line.  The  goal,  in  both  cases,  is  to  optimize  the 
performance  measured  as  a  combination  of  costs  of  queries,  updates  (including  deletions 
and  insertions)  and  index  storage. 

In  the  on-line  case,  adding  an  index  on  a  field  may  reduce  the  cost  of  a  query  that 
has  a  resolvable  predicate  on  that  field.  On  the  other  hand,  it  will  increase  the  cost  of 
an  update  on  that  field.  The  object  of  optimization  is  to  find  an  index  set  with  the 
smallest  sum  of  query  and  update  costs.  Because  of  the  high  cost  of  the  on-line  index 
update,  the  optimal  number  of  indices  is  usially  small  and  thus  we  usually  do  not  put 
an  additional  limitation  on  the  number  of  indices  or  the  index  storage  cost. 

In  the  off-line  case,  the  updates  are  done  when  the  system  is  not  available  for 
retrievals,  typically  overnight.  The  update  cost  can  be  ignored  if  all  updates  can  be 
completed  during  that  time.  The  number  of  indices  in  this  case  is  usually  constrained 
by  the  time  available  for  the  off-line  update  and  by  the  index  storage  cost  If  there 
is  no  such  constraint,  the  problem  is  trivially  solved  by  indexing  every  field. 

The  optimization  problem  is  usually  formulated  by  assigning  some  weights  to  different 
indices  and  minimizing  the  query  cost  while  keeping  the  sum  of  index  costs  below  a 
certain  number.  \  simpler  formulation,  which  we  use  in  this  chapter,  is  to  assign  equal 
weights  to  all  indices.  The  goal  of  optimization  then  is  to  minimize  the  average  query 
cost  using  no  more  than  some  number  of  indices. 

In  both  on-line  and  off-line  cases,  the  OIS  problem  is  NP-hard  [Comer  78,  Piatetsky 
83.  Ip  83]. 

OIS,  in  its  many  variations,  was  analyzed  by  many  authors.  Some  of  them  have  used 
formal  analysis  in  an  attempt  to  derive  the  exact  solution  -  [King  74.  Stonebraker 
74.  Schkolnik  75.  Yue  75.  Anderson  77.  Batory  82b.  Whang  81] .  while  others  have  used 
a  heuristic  approach  -    [Lum  71.  Hammer  76.  Held  75,  Ro'ussopoulos  82.  Ip  83] . 
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In  particular,  Schkolnik  [Schkolnik  75.  Schkolnik  79]  considered  single-file  queries 
which  are  conjunctions  of  field  =  constant  in  the  context  of  on-line  update,  with  no 
constraint  on  the  secondary  storage.  He  assumed  that  all  indices  are  used  in  query 
evaluation  (see  Use-All  assumption,  chapter  5).  independence  of  different  fields  and 
uniform  distribution  of  field  values.  Schkolnik  introduced  a  regularity  condition  on 
functions  of  index  sets  and  proved  that  the  cost  of  the  average  query  satisfies  this 
condition.  The  consequence  is  that  the  depth-first  search  in  the  space  of  all  index  sets 
will  find  the  optimal  solution.  The  running  time  of  that  search  is  0{2^  log  N),  where 
/V  is  the  number  of  fields  to  consider  for  indexing. 

Hammer  and  Chan  [Hammer  76]  also  examined  single-file  queries  in  the  on-line 
update  environment.  The  allowed  queries  are  conjunctions  and  disjunctions  of  equalities. 
The  queries  are  evaluated  by  using  indices  on  all  resolvable  predicates.  Hammer  and 
Chan  presented  a  heuristic  algorithm  with  a  polynomial  running  time  for  the  index 
selection  problem.  Their  algorithm  usually  finds  good  solutions  but  is  not  guaranteed  to 
find  the  optimal  solution  or  to  be  within  some  percentage  of  the  optimal  solution. 

Ip  and  others  [Ip  83]  reduce  on-line  OIS  to  knapsack  problem  and  use  knapsack 
approximation  algorithms   [Horowitz  78]    with  good  results. 

In  this  chapter  we  consider  the  case  of  off-line  update.  We  assume  that  the 
information  about  the  query  set  is  stored  in  the  Frequency  Table  (see  section  5.6). 
Formally,  the  problem  is: 

Off-Line  Optimal  Index  Selection  Problem 

Given  a  file 

Given  a  query  frequency  table  FT  with  N  fields 
Find  an  index  set  with  no  more  than  K  indices,  which 
minimizes  the  average  query  cost. 

This    problem    is    NP-hard   [Piatetsky    83].        In    the    following    sections    we    discuss 

algorithms  for  solving  it. 

6.1  ALGORITHMS  FOR  OPTIM.AL  INDEX  SELECTION 

We  have  implemented  three  algorithms  to  solve  this  problem.  The  first  is  the  total 
search  of  all  combinations  of  K  indices  or  less.  It  always  finds  the  optimal  solution 
but  its  running  time  is  exponential  and  it  is  impossible  to  use  ii  for  problems  with  over 
10  fields.  Still,  it  is  useful  for  debugging  and  as  a  benchmark  for  other  algorithms. 
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Figure  6-1:       Comparison  of  B&.B,  Greedy  and  Total  search  algorithms 

We  have  used  a  frequency  table  describing  559  queries  on  file  Industrial  Compustat. 
The  table  has  33  fields.  95  query  types  and  68  independent  query  types.  It  was 
obtained  by  recording  actual  user  queries  and  transforming  them  to  increase  the  number 
of  different  querytypes.     All  three  algorithms  found  the  optimal  index  set  m  all  cases. 

f^SetS   is  the  number  of  index  sets  whose  cost  was  computed  by 
the  algorithm;  Cpu    is  the  CPU  seconds  spent  by  the  algorithm. 
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The  second  algorithm  (called  B&B)  is  a  Branch  and  Bound  search  wuh  an  admissible 
cutoff.  It  is  guaranteed  to  find  an  optimal  solution  and  is  much  faster  that  the  tola! 
search  algorithm.  It  finds  the  optimal  solution  for  a  sample  problem  (see  figure  6-1) 
of  33  fields  in  about  a  minute  of  CPU  time. 

The  third  algorithm  is  a  hill-climbing,  or  "Greedy",  nonbacktracking  search  with  the 
same  admissible  cutoff.  While  it  is  not  guaranteed  to  find  the  optimal  solution,  it 
actually  found  it  for  all  the  Frequency  Tables  we  got  from  users  of  FASTSCAN.  The 
Greedy  algorithm  is  very  fast  -  its  worst  and  average  case  running  time  is  0(N*Kj.  It 
solved  the  same  problem  of  33  fields  in  2.5  seconds  of  CPU  time  and  found  the 
optimal  solution. 

Both  the  B&B  and  Greedy  algorithms  search  a  "tree"  of  all  possible  index 
combinations.     The  combinations  are  considered  in  a  fixed  order  : 


1 

1,  2 

1.  2.  .. 
1.  2.  .. 

..  K-l,  K 
...  /C-1.  /C+1 

1.  2 /C-1,  N 

This  order  enables  us  to  do  an  incremental  computation  of  index  set  costs  because  we 
only  need  to  compute  the  cost  from  adding  one  index  to  the  set.  This  makes  our 
algorithms  an  order  of  magnitude  faster. 

B&B  algorithm,  while  traversing  this  tree,  saves  in  a  stack  all  costs  along  the  currently 

active  path.      Thus  when   B&B  backtracks  from   index   combination    (1.2 L  M]    to 

{1,2 L]    it  never   needs   to  recompute   the  cost  of   the  latter  -   it  simply   gets  that 

cost  from  the  stack. 

At  the  heart  of  efficiency  of  both  algorithms  are  the  good  initial  ordering  of  fields 
and  an  admissible  cutoff  criterion.     We  explain  them  m  the  next  two  sections. 

6.2  INITIAL  ORDERING 

The  max/mum  improvement  from  an  index  on  a  field  A.  denoted  MaxImpriA).  is 
computed  as  the  improvement  in  the  average  query  cost  assuming  ihai  adding  an  index 
on    A    would    cause    the    maximum    possible    improvement    in    each    querytype    having    A. 
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This   value   is  an    upper    bound   for   the   improvemenl   in   the  average   query   cost   from 
adding  an  index  on  A  to  any  index  set 

For  querytypes  with  /4  as  a  top-level  conjunct,  (e.g..  A  &l  B)  the  maximum 
improvement  comes  when  the  optimal  EM  for  that  querytype  changes  from  a  sequential 
scan  to  an  index  scan  on  A  .  For  querytypes  with  A  -is  ^  part  of  the  top-level 
disjunct  (e.g.,  /4  \/  fi)  the  maximum  improvement  comes  when  the  optimal  EM  changes 
from  a  sequential  scan  to  an  EM  using  the  disjunct  of  which  /*  is  a  parL 

The  number  of  queries  of  the  first  type  is  CumCnt(/4)  with  sum  of  seleciiviiies 
SumSeK/A).  The  number  of  queries  of  the  second  type  is  OrCni(/4)  with  sum  of 
selectivities  OrSumSel(/4).     Let 

-  AvSelKA)  =  S\xmSt\{A)  /  CumCnt(-4) 

-  OrAvSelKA)  -  OrSumSel(-4)  /  OrCnt(<4) 

-  CostSS  be  the  cost  of  a  sequential  scan 

-  CostEMHA.  SEL  )  be  the  cost  of  an  evaluation  method  using  just  one  index 

A 

scan  on  A  with  selectivity  SEL  . 

CosiEMKA    SEL  )    =   COST  EM(t>A}.    (SEL  }.    SEL  ).    where    COST_EM    is 

A  —  A  A 

defined  in  chapter  5. 
Then  the  maximum  improvement  from  queries  of  the  first  type  (i.e..  where  A  \s  ^  lop- 
level  conjunct)  is: 

CumCnt  •  (CostSS  -  CostEMl(/4.  .ALvSeU/A))  ) 

The  maximum  improvement  from  queries  of  the  second  type  (i.e..  where  /A  is  a  top- 
level  disjunct)  is 

OrCnt  •  (CostSS  -  CostEMl(>A.  OrAvSeK/A))  ) 

Summing  these  values  and  dividing  by  the  total  number  of  queries  we  can  compute 
MaxImpr(-4)  as 

MaxImpr(>A)=  (1/TOTCNT)  •  (6.1) 

[CumCnt  •  (CostSS  -  CostEMl(>A,  AvSeK/4))  * 
OrCnt  •  (CostSS  -  CostEMK/A,  OrAvSel(>l))  )] 

We  get  the  initial  ordering  for  both  the  Greedy  and  B&B  algorithms  by  sorting  the 
fields  used  in  queries  in  the  descending  order  of  their  Maximpr.  Thus,  the  field  that 
gives  the  most  improvement  is  the  first,  etc.  . 
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Hereafter  all  field  numbers  refer  to  this  initial  ordering. 

6.3  THE  CUTOFF  CRITERION 

Let  i,  ^  denote  index  sets.  |  <f>  |  is  the  size  of  the  index  set  <l>.  When  we  discuss  a 
cost  or  an  improvement  in  <I>  we  refer  to  the  average  query  cost  under  <I>  or  the 
improvement  in  that  cost  from  adding  some  additional  indices  to  4>. 

Let  LASK*)  be  the  largest  (in  initial  ordering)  field  number  in  <I>.  The  extension 
of  <l>  is  any  index  set  *  such  that 

L  all  fields  in  4>  are  in  4» 

2.  all  fields  in  *  -  #  have  field  numbers  >  LAST(<t>) 

Both  B&B  and  Greedy  search  algorithms  work  by  selecting  an  index  set  and  trying  to 
extend  it  one  index  at  a  time.  At  any  point,  given  an  index  set  <I>  we  want  to  know 
whether  any  extension  of  it  can  be  the  optimal  index  set  If  we  can  compute  a  lower 
bound  for  the  cost  of  any  extension  of  <l>  and  jt  is  higher  than  the  cost  of  the  best 
index  set  we  have  had  so  far.  then  we  can  stop  extending  $  -  "cut"  the  branches  from 
*.  This  is  the  essence  of  pur  cutoff  tesL  We  note  that  it  is  not  a  heuristic  -  if  the 
cutoff  eliminates  a  branch  of  the  search  tree,,  then  this  branch  is  guaranteed  not  to 
contain  the  optimal  solution. 

To  be  useful,  the  cutoff  test  must  be  computationally  efficient  Thus  we  need  a  fast 
method  to  compute  the  lower  bound  on  the  cost  of  an  extension.  Below  we  describe 
two  methods  which  take  linear  time. 

6.3.1  Lower  bound  on  extension  cost  using  Maximpr 

One  way  to  determine  the  lower  bound  on  the  cost  of  any  extension  of  <^  is  to  use 
Maximpr.       The   maximum    number    of    indices   we   can   add    to    4>   is   A"- 1  <I>  |  The 

improvement  from  adding  these  indices  is  no  more  than  the  sum  of  their  individual 
improvements.  Since  the  fields  are  sorted  in  the  decreasing  order  of  Maximpr.  the 
biggest  improvement  for  any  extension  of  *  would  be  if  we  get  A"-  j  *  |  fields 
immediately  after  LAST(«t').     Let  Z.  =  LAST(4>).     Then 

COST(any  extension  of  4')  >  (6.2) 

COST(<t>)  -  (.Maxlmpr(Z.-l)  ♦  ...  *  Maxlmprf/.-ZC- ;  "t; )  ) 
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6.3.2  Lower  bound  on  extension  cost  using  ImprTail 

Let  Tail(A)  denote  the  set  of  fields  from  field  ^A  in  the  initial  ordering  and  until 
the  last  field  #A/. 

The  maximum  tail  improvement  of  A.  denoted  ImprTail(A).  is  the  maximum 
improvement  in  the  average  query  cost,  computed  as  the  difference  between  the  cost  of 
a  sequential  scan  and  the  average  cost  of  query  using  indices  on  all  fields  in  Tail! A). 
ImprTaiK/A)  is  an  upper  bound  on  the  improvement  in  average  query  cost  from  adding 
indices  on  Tail(/4)  to  any  index  set  not  containing  them. 

From  the  definition  of  extension  we  have  that  for  any  index  set  +  which  is  an 
extension  of  <J>. 

all  fields  in  +  -  *  are  in  Tail(LAST(*)*l) 
Therefore 

COSTiany  extension  of  *)  >  (6.3) 

COSK*)  -  ImprTaiK  LAST(*)'l) 

To  make  effective  use  of  ImprTail.  we  precompute  ImprTail(/4)  for  all  A  before  we 
begin  the  search  for  the  optimal  index  seL  The  compuution  is  done  by  computing  the 
average  query  cost  with  no  indices,  then  with  just  the  last  field  indexed,  then  with  just 
the  last  two  fields  indexed,  etc..  finally  with  all  fields  indexed.  Since  we  add  only  one 
index  at  a  time,  the  computation  lakes  a  linear  time. 

6.3.3  The  cutoff  theorem 

By  combining  the  above  equations  we  get  a  stronger  lower  bound  on  the  cost  of  any 
extension  of  <l>. 

Theorem  6.1:     Let  4>  be  an  index  set.  Let  Z.  =  LAST(<t>).     Then 
COSTl*  *  any  K-\^\   fields  from    (/.-I N]   )  > 

COST(*)  -  min  [ImprTaiia*!). 

MaxImpr(Z.*l)  +  ...  +  MaxImpr(Z.+/C-i  4>| )  ] 

Proof:  equations  (6.2)  and  (6.3). 

Using  this  theorem,  we  can  formulate  our  cutoff  criterion.  We  note  thai  BestCost  is 
initialized  as  the  average  query  cost  when  no  fields  are  indexed.  At  any  point  in  the 
search,  let  *  be  the  current  index  set.  L  =  LAST(<t>)  and  BestCost  be  the  cost  of  the 
best  index  set  found  so  far.     Then 
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The  Cutoff  Criterion 


if  C0ST(4>)  -  min[ImprTail(Z.+  l). 

MaxIinpr(Z.+l)  +  ...  +  MaxImpr(Z.-^/C-|*|)] 
t  BestCost 

then  stop  extending  4>,  since  this  will  not  produce 
an  index  set  with  cost  less  than  BestCost. 


6.4  THE  LOWER  BOUND  ON  THE  GREEDY  ALGORITHM 

The  Greedy  Algorithm  with  a  good  initial  ordering  of  fields  and  an  admissible  cutoff 
exhibits  an  excellent  performance.  In  this  section  we  present  an  additional  argument  in 
favor  of  the  Greedy  algorithm,  by  showing  that  for  a  certain  simplified  version  of  our 
OIS  problem,  the  Greedy  algorithm  is  guaranteed  to  find  a  solution  which  is  at  least 
63%  of  the  optimal. 

The  simplified  version  of  OIS  is  obtained  by  assuming  that 

1.  All  queries  are  of  the  form 

A  =const    Sl  ...  &.  A  =const 

2.  All    fields    have    the    same    low    density    (see    ch.    2),    and    as    a    result    all 
comparisons  A  =  const  are  estimated  to  have  the  same  selectivity. 

3.  That  selectivity  is  so  low   that  for  each  query  an  optimal  EM   uses  just  one 
index  scan  on  any  of  the  comparisons  A   =  const    m  query. 

Then  we  can  map  the  set  of  all  queries  to  the  universal  set  S.  Each  query  maps  to  a 
point  in  S.  A  query  with  a  field  #/  maps  to  a  point  in  the  subset  of  S  called  S .  A 
query  on  several  fields  corresponds  to  a  point  which  belongs  to  several  subsets. 

This  mapping  transforms  OIS  to  the  Max/mum  Cover  Problem.    [Piatetsky  83]: 

Maximum  Cover  Problem: 
Given  N  sets  S     ,  S S  .  select  K  sets  such   that   their  union   has  the 

1  Z  N 

maximum  size  (i.e.,  gives  the  maximum  cover). 
This  problem,  abbreviated  MCP.  is  NP-hard. 
The  Greedy  algorithm  applied  to  MCP  can  be  described  as: 
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1.  Surt  with  an  empty  set  collection. 

2.  Add  a  set  which  would  give  a  maximum  improvement  over  the 

current  collection. 

3.  If  the  collection  has  less  than  K  sets,  go  to  step  2: 
Otherwise  stop. 

Figure  6-2:      Greedy  .\lgoriihm  applied  to  MCP 


Let  IS  I  denote  the  size  of  a  set  S  .  When  we  speak  of  the  size  of  a  collection  of 
sets,  we  refer  to  the  size  of  the  union  of  those  sets.  The  "good"  behaviour  of  the 
Greedy  algorithm  follows  from  the  following  theorem  which  gives  the  exact  lower 
bound  on  the  result  of  the  Greedy  algorithm. 

Theorem  6.2:  Let  the  Greedy  Cover  be  the  collection  of  K  sets  found 
by  the  Greedy  algorithm  (see  figure  6-2)  for  MCP.  Let  the  Maximum 
Cover  be  the  maximal  collection  of  K  sets,  chosen  from  among  N  sets. 
Then  the  smallest  possible  ratio  \CreedY  Cover'  I '  Maximum  Cover\ 
(achieved  in  the  worst  case  for  the  Greedy  algorithm)  is  given  by 

1  Greedy  Cover  I 

'■ —     =  1  -  (1  -  l/Kf 


I  Maximum  Cover  | 

The  proof,  done  by  constructing  the  worst  case  example,  is  given  in  the  appendix. 

Corrolary:  Since  (l-l/ZC)'^  asymptotically  increases  to  1/e.  where  e  =  2.71828...  .  and 
l-d-l/ZC)"^  asymptotically  decreases  to  1-1/e,  we  can  infer  that  for  any  MCP  problem. 

I  Greedy  Cover]  1 

'. >  1  -    -    =     0.632... 

I  Maximum  Cover  |  e 

This  result  gives  us  a  measure  of  how  good  are  the  solutions  found  by  the  Greedy 
algorithm  for  a  simple  abstraction  of  OIS.  For  the  general  OIS  it  means  thai  a 
solution  found  by  the  Greedy  algorithm  will  have  an  index  scan  on  at  least  63*^0  of  all 
resolvable  queries.     In  practice,  this  figure  is  much  higher  and  is  close  to  100%. 

The  previously  described  Greedy  algorithm  can  be  called  an  "uphill"  Greedy  -  it 
starts  with  no  sets  and  adds  them  one  at  a  time.  .K  variation  on  that  approach  is  the 
"downhill"  Greedy:  start  with  all  sets  and  discard  one  set  at  a  time.  The  worst  cases 
of  both  algorithms  are  different.  For  example,  the  "downhill"  Greedy  will  find  the 
optimal  solution  in  the  previously  described  worst-case  of  the  "uphill"  Greedv.  We 
plan  to  investigate  the  worst-case  behaviour  of  the  "downhill"  Greedy. 
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We  conjecture  thai  the  combination  of  both  "uphill"  and  "downhill"  Greedy  algorithms 

will  have  a  higher  lower  bound  on  the  Greedy  Cover  than  any  one  of  these  algorithms 
alone. 
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7.  SUMMARY  AND  DIRECTIONS  FOR  FUTURE 
RESEARCH 

In  this  dissertation  we  presented  a  theory  for  the  Self-Organizmg  Database  System, 
described  a  partial  implementation  of  that  theory  in  the  FASTSCAN  query  system,  and 
presented  new  theoretical  results  m  the  areas  of  smgle-file  query  optimization, 
estimation  of  the  number  of  records  satisfying  a  condition,  admissible  recording  of 
query  frequencies,  and  lower  bounds  for  approximations  computed  by  the  Greedy 
algorithm. 

In  chapter  2  we  analyzed  estimation  of  the  number  of  records  satisfying  a  condition. 
We  presented  a  new  method  for  accurate  estimation  of  that  number  for  conditions  of 
the  type  field  rel-op  const.  We  also  suted  axioms  on  an  estimation  scheme  and 
derived  a  scheme,  satisfying  the  axioms,  with  the  smallest  possible  worst-case  estimation 
error. 

In  chapter  3  we  introduced  our  data  and  cost  models.  We  described  the  elementary 
operations  on  data  and  used  them  to  construct^  the  elementary  operations  on  streams  of 
pointers  and  records.  We  defined  a  query  evaluation  method  (EM)  as  a  sequence  of 
stream  operations  on  some  of  the  query  tree  nodes.  In  this  way  we  formalized  the 
space  of  possible  EMs  for  each  query. 

In  chapter  4  we  showed  how  to  select  an  optimal  EM.  We  gave  an  algorithm  for 
finding  an  optimal  EM  for  queries  in  conjunctive  normal  form  (CNF).  Then  we 
demonstrated  how  to  convert  every  query  to  CNF  preserving  or  reducing  the  cost  of  an 
optimal  EM.  Therefore,  to  get  an  optimal  EM  we  first  convert  a  query  to  CNF  and 
then  find  an  optimal  EM  on  the  CNF  query. 

In  chapter  5  w'e  used  the  formula  for  the  cost  of  an  optimal  EM  to  analyze  different 
ways  of  storing  a  set  of  queries.  We  presented  an  admissible  f-scheme.  which 
allows  a  correct  computation  of  the  average  query  cost.  We  also  described  two 
simplifications  of  that  scheme,  which  are  not  admissible  in  general  but  allow  us  to 
compute  a  good  approximation  of  the  average  query  cost  for  most  practical  cases. 

In  chapter  6  we  described  two  algorithms  for  the  optimal  index  selection  (OlS) 
problem.  One  is  a  backtracking  "Branch  &.  Bound"  (B&B)  algorithm  which  is 
guaranteed  to  find  the  optimal  solution  and  another  is  the  heuristic  nonbacktracking 
search    algorithm,    a    variation    of    the    Greedy    method.        We    explained    the    initial 
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ordering  of  the  indices  and  the  powerful  cutoff  criterion  which  together  make  both 
algorithms  very  fast  Even  though  OIS  is  NP-hard.  the  B&B  algorithm  finds  a  solution 
for  practical  cases  with  30  fields  in  under  a  mmute. 

The  complexity  of  Greedy  algorithm  is  0(/V')  and  even  though  it  is  not  guaranteed  to 
find  the  optimal  solution,  it  does  so  in  almost  all  practical  cases.  One  reason  for  its 
good  performance  on  OIS  may  be  that  the  Greedy  algorithm  is  guaranteed  to  find  a 
good  solution  for  a  simplification  of  the  OIS,  the  Maximum  Cover  Problem.  We  found 
the  exact  lower  bound  for 

I  Greedy  Cover  | 

=  1  -  {l-\/Kt 

I  Maximum  Cover  | 

and  remarked  that  it  is  greater  than  1  -  1/e  -  0.632...  . 

7.1  DIRECTIONS  FOR  FUTURE  RESEARCH 

There  are  many  areas  for  further  research.  First,  there  are  refinements  of  issues 
raised  in  this  dissertation,  such  as: 

-  Incorporation  of  Difference  of  pointer  lists  into  selection  of  optimal  EM. 

-  Incorporation  of  Sort  into  selection  of  an  optimal  EM. 

-  Development   of   a   two-field   distribution   table   for   estimating  selectivities  of 
conjunctions  and  disjunctions  of  predicates. 

-  Analysis  of   performance   of    the   "downhill"   Greedy   algorithm,   alone  and   in 
tandem  with  the  "uphill"  Greedy  algorithm. 

Second,  there  are  issues  left  out  of  this  dissertation.     The  most  important  ones  are: 

-  multifile  queries 

-  reorganization  process 

-  reorganization  points 

-  on-line  update 

-  vertical  partitioning 

Multifile  queries:  We  plan  to  explore  the  approach  to  multifile  query  optimization 
based  on  propagating  streams  of  records  and  pointers  through  a  query  iree.  In 
addition,  we  will  investigate  f-schemes  for  recording  multifile  query  frequencies. 
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On-line  update:  another  way  to  expand  the  scope  of  this  work  is  to  consider,  besides 
queries,  also  updates,  deletions  amd  insertions.  We  need  to  expand  basic  data  and 
stream  operations  and  extend  the  theory  of  optimal  query  evaluation  to  the  other 
transactions.  We  also  need  to  investigate  admissible  f-schemes  and  modify  the 
algorithms  for  optimal  index  selection. 

Reorganization  process:  the  reorganization  process  requires  adding  and  deleting  some 
indices.  Some  application  programs  and  access  modules  can  become  invalid  or 
inefficient  as  a  result.  F.^STSCAN  deals  with  this  problem  in  a  following  way.  It 
generate  programs  for  repeatedly  used  queries.  Whenever  such  a  program  is  run.  the 
monitor  (which  in  honor  of  1984  we  called  the  "Big  Brother")  checks  if  the  generated 
code  is  valid.  If  it  is  not.  new  code  for  this  query  is  automatically  generated  and  run. 
A  similar  approach  is  used  in  System  R  [Chamberlin  81]  for  regeneration  of  access 
modules  which  use  indices  that  were  deleted. 

Reorganization  points:  the  simplest  approach  is  to  reorganize  at  some  regular 
intervals.  A  better  idea  is  to  reorganize  only  when  the  improvement  m  performance 
over  some  future  interval  outweighs  the  cost  of  reorganization,  including  the 
recompilation  of  application  programs.  A  sophisticated  theory  for  choosing  file 
reorganization  points,  developed  in  [Batory  82a] ,  can  be  applied  here.  Another  issue  is 
taking  into  account  previous  reorganizations.  Chan  [Chan  76]  uses  exponential 
smoothing  to  forecast  trends  in  usage  of  each  index.  Other  statistical  techniques  may 
be  applicable. 

Attribute  partitioning:  this  is  an  active  area  of  research  [March  83] .  Hammer  and 
Niamir  [Hammer  79]  have  designed  a  prototype  SODS  which  reorganizes  its  attribute 
partitioning.  A  very  interesting  research  area  is  the  design  of  an  SODS  which  combines 
the  optimal  index  selection  and  attribute  partitioning. 


In  summary,  we  believe  that  we  have  established  a  theoretically  interesting  and 
practically  useful  framework  for  further  development  of  Self-Organizing  Database 
Svstems. 
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I.  PROOF  OF  THE  LOWER  BOUND  OF  GREEDY 

ALGORITHM  FOR  THE  MAXIMUM  COVER  PROBLEM 

We  restate  the  Greedy  Lower  Bound  Theorem  6.2.  from  chapter  6. 

Theorem  LI:  Let  the  Greedy  CCA/er  be  the  set  collection  found  by  the 
Greedy  algorithm  (see  figure  6-2))  for  MCP.  Let  the  Maximum  Cover  be 
the  maximal  collection  of  K  sets,  chosen  from  among  N  sets.  Then  the 
lower  bound  of  \Greedy  Cover \   I   \ Maximum  Cover \  is  : 

I  Greedy  Cover  | 

=  1  -  (1  -  l/Kt 


I  Maximum  Cover  I 


This    lower    bound    can    be    approached    arbitrarily    closely    in    MCP    instances 
representmg  the  worst  case  for  the  Greedy  algorithm. 

Notation:  An  instance  of  MCP  is  defined  by 

MCP  =  (N,  K,   {S    S     }    ) 

I  N 

where    {S     S  }    are  the  N   sets,   and   K   is   the  number  of   sets  that   the  Greed v 

1  N 

algorithm   is   allowed   to   choose.      Let   MC   =    {M     M  J    denote   the   Maximum 

cover.     Let  GC  =   {G^ Q^   denote  the  Greedy  Cover.     Let  GIM  ratio  be 

G/M  ratio  =   |GC|    /    |MC| 

Proof  of  the  Theorem:  The  proof  consists  of   two  parts.      In   the  first  part  we  will 
consider  a  class  of  MCP  instances  where  the  MC  and  the  GC  are  such  that  : 

L  Every  set  is  either  in  MC  or  in  GC 

2.  There  are  no  sets  which  are  both  in  GC  and  MC 

3.  GC  sets  do  not  intersect  between  themselves 

4.  MC  sets  do  not  intersect  between  themselves 

5.  GC  cover  is  totally  contained  within  MC  cover 

6.  MC   sets  are   equal   in   size,   and   every   GC  set   is   divided   by   MC   sets   into 
equal  parts. 

We  will  construct,  for  this  class  of  instances,   the  the  worst-case  example  of  MCP 
where  G/M  ratio  approaches  the  lower  bound  arbitrarily  closely. 

Then,  in  the  second  part  of  our  proof,  we  will  show  for  each  of  these  6  properues  : 
if  an  instance  of  .VICP  does  not  satisfy  the  property,  ihen  we  can  transform 
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il  10  another  instance  of  MCP.  satisfying  the  properly  and  with  the  same  or 
lower  G/M  ratio  . 


Therefore,  for  any  instance  of  MCP.  we  can  construct  another  instance,  satisfying  all 
6  properties,  and  having  the  same  or  lower  G/M  ratio.  Since  the  first  pan  of  our 
proof  shows  that  for  such  MCP  instances  G/M  ratio  >  1  -  (1-1 /K)*^,  we  conclude  that 
for  any  MCP  instance  G/M  ratio  >  1  -  (1-1/K)''  . 

I.l  CONSTRUCTION  OF  THE  WORST-CASE  EXAMPLE 

Here  we  examine  a  class  of  MCP  instances  satisfying  all  6  properties.  From  the  first 
two  we  can  conclude  that  K  =  N/2. 


Without  loss  of  generality  we  can  assume  that 

I   Maximum  Cover   |    =  1 

If  it  is  not  so.  we  can  suitably  normalize  all  the  sets  in  the  MCP.     Thus  G/M  ratio  = 

I  Greedy   Cover  1.      Let   the  maximum  cover  consist  of   K  disjoint  "vertical"  slices,   M,. 

M, M    .  all  equal  in  size  to  1/K.     We  will  construct  a  set   {G  }  that  will 

:  L  ^  ,   1-1.  K 

be  selected  by   the  Greedy  algorithm.     This  set  will  consist  of   K  disjoint  "horizontal" 
slices,  decreasing  in  size.  (See  figure  I-l). 


M 


M 


Figure  I-l:      Worst  case  MCP  instance  for  Greedy  .algorithm 


G  .  to  be  selected  by  the  Greedy  algorithm  at  the  first  step,  should  be  the  largest  of 
N  sets.     Let  it  be  just  slightly  larger  than  M  ,  i.e.. 
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|G  I   =  1/K  +  e,  where  e  is  infinitesimal 

Let  G,  be  just  slightly  larger  (by  the  same   e)  than  the  part  of  M    not  covered  by  G  . 
That  pan  has  size  1/K  -  |G  |/K.    So 

IG  I   =  1/K  -   IG  l/K  +   f  =   IG  I   -   IG  l/K  =   IG  I   (1  -  1/K) 

Analogously,  G    is  just  slightly  larger  (by   the  same   e)  than  part  of  M     not  covered 
by  either  G    or  G,. 

|GJ    =  1/K  -    |GJ/K  -    |GJ/K  +    e  =    |Gj    -    |Gj/K  = 
=   |GJ   (1  -  1/K)  =   IGJ   (1  -  1/K)' 

G    G    will  be  build  similarly.    Size  of  G    will  be 

4  K  K  , 


|GJ   =    |GJ   (1  -  1/K) 


K-l 


The  cover  selected  by  the  Greedy  algorithm  will  have  the  size 

COVER  =   |GJ   +   IGJ   +  ...  +   |GJ 
Since  (as  easily  shown  by  induction) 


Then 


|G  I   =   |G  I   (1-1/K)' 

1,1  '         1   ' 


COVER=|G  I   +   |G  I  (1-1/K)  *   IG  I  (1-1/K)-  *  ...  +    |G  |  (1-1/K)''"')= 
1,11,1  '     I '  '     I ' 


1  -  (l-l/K)"  1  -  (1-1/K)'' 


=   |GJ =     (1/K  +   c) 

1  -  (1-1/K)  1  -  (1-1/K) 


=  (1  +   eK)  (1  -  (1-1/K)^  ) 
Since    e    can    be   made   arbitrarily   small.    COVER   and    hence   G/M    ratio   can    be   made 
arbitrarily  close  to  1  -  (l-l/K)*^  -  1  -  1/e  ^  0.632  . 

1.2  PROPERTY  1   -  EVERY  SET  IS  EITHER  IN  MC  OR  IN  GC 

Proof:  It  is  sufficient  to  prove  that  if  there  is  one  set  A  not  in  MC.  nor  in  GC. 
then  we  can  eliminate  it  and  the  resulting  instance  of  MCP  will  have  the  same  GC  and 
MC  . 

Clearly,   MC  will  be  the  same.     CC  will  also  be  the  same,  since  ai  each  step.  Greedy 
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algohihm  considers  inierseciion  of  any  set  onJy  with  sets  previously  selected.  Therefore 
if  A  was  not  selected  in  the  original  instance  of  MCP.  then  its  intersections  with  any 
other  sets  do  not  affect  the  selection  of  'Greedy"  sets  in  the  new  insunce. 

1.3     PROPERTY  2   -  THERE  .ARE  NO  SETS  WHICH  .ARE 
BOTH  IN  GC  AND  IN  MC 

Proof:  Suppose  that  there  is  an  instance  of  MCP 

MCP,=(  N.  K.    {S^ SJ) 

which  has  a  Greedy  cover  GC  ={G G  1   and  a  Maximum  cover  MC,  =    (M^ 

M  }    such  that  there  are  L  sets  in  common  between  GC    and  MC  .     Let  A  be  one  of 

them    A=G    in  GC    and  A=M    in  MC  .     Let  us  "duplicate"  this  set  and  call   the  copy 

1  1  J  1 


Original  MCP 


■  ■  «♦ 

A'  I 


Modified  MCP 


Figure  1-2:       Transformation  for  Property  2 
Let  us  consider  a  new  instance  of  MCP. 

MCP,  =  (  N*l.  K.    {S|  A.A" S^  }    ) 

We  will  show  that  it  has  a  Greedy  cover  and  a  Maximum  cover  with  one  less  sets  in 
common  but  with  the  same  G/M  ratio. 


Greedy  algorithm  will  select  the  same  sets  G,.  G,.  C_^.  until  it  comes  to  step  #i.  In 
the  original  instance  MCP  set  A=G  was  selected.  Now.  in  MCP,.  the  choice  is 
between  A  and  A'.  Since  they  are  exactly  equal  in  all  respects,  choice  can  be  made 
only  at  random.  Suppose  A'  was  selected.  Then  A  would  not  be  selected  by  the 
Greedy  algorithm  at  any  further  step,  since  improvement  from  adding  A  to  a  set 
containing  its  exact  copy  A"  is  zero.     Therefore  we  will  have  a  Greedy  cover 

GC-={G. G  ,.A',G 


'V 


and 


iGc;!  =  |GC. 


Since  A=A".  there  are  two  equal-sized  maximum  covers.  MC.  (which  is  equal  to  MC^). 
and  MC,'  which  is  obtained  from  MC^  by  replacing  A  with  A".  W.l.g.  we  can  choose 
MC,  .  MC,  and  GC^'  have  one  less  set  in  common  than  MC,  and  GC,.  but  have  the 
same  G/M  ratio. 


Appendix  I 


-  91  - 


Proof  of  GreedVs  Lower  Bound 


By  repeating  this  transformation,  we  will  finally  obtain  an  instance  of  MCP  in  which 
GC  and  MC  have  no  sets  in  common.  Q.E.D. 

Hence  it  is  enough  to  find  the  worst  case  when  N=2K,  for  if  K  >  N/2  ,  the  worst 
G/M  ratio  will  be  higher. 

1.4  PROPERTY  3  -  GC  SETS  DO  NOT  INTERSECT  BETWEEN 
THEMSELVES 

Suppose   that  there  are  such  sets.      Let  G    be   the  first  GC  set   that  has  intersection 

with  some  previous  GC  set  G  .      Then  we  can  construct  a   new   instance  of   MCP   bv 

r 

replacing  G    with  G '  =  (G   -  G  )  (see  next  figure) 

I  lip 


^SSSSSXXSSSSSSSSSXSSXSXSH 


I  *---*    Pi 


Original    MCP 


+-+ * — 

I  G 


h  S-5  33S3S3CX3Z33333ZS533: 

Mod  i  f  i  ed  MCP 


Figure  1-3:       Transformation  for  Property  3 

Since  G    is  not  in  MC  by  property  2.  and  it  is  reduced,  MC  is  not  changed.     Greedy 
cover  will  be 


GC-={G G    .  G'  ,  G   ,. 

1  1-1       1  i*i 


Gk> 


Since  the  improvements  from  G ',  G     G     over  the  union  of  all  precedmg  sets 

11*1  K 

are   the  same  as  improvements  of   G.   G^^^ G^.   we   infer   thai    |  GC"  i    =    |GC|. 

Hence  G/M  ratio  remains  the  same  as  in  the  original  instance. 


By  repeating  this  transformation,  we  will  eventually  get  an  instance  of  MCP  with  the 
same  G/M  ratio,  where  no  Greedy  set  infersects  with  any  preceding  Greedy  set,  and 
hence  no  two  Greedy  sets  intersect  . 
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1.5  PROPERTY  4  -  MC  SETS  DO  NOT  INTERSECT 
BETWEEN  THEMSELVES 

The  proof  here  is  esseniially  ihe  same  as  for  property  3.     Suppose  that  there  are  such 
sets.     Let  M    be  the  first  MC  set  ihai  has  intersection  with  some  previous  MC  set  M  . 
Then  we  can  construct  a  new  instance  of  MCP  by  replacing  M    with  M  "  =  M    -  M 
(see  next  figure) 


^SBScssasi 


*-!-:.!-. 


'M'     ' 


%! 


^■■ssBaxssi 


[  X  =  3  s  =  =  a  1 


Original    MCP 


►  SBBXSCSSXS  =  3  =  S3SS  =  =  =  : 

Mod  I f 1 ed  MCP 


Figure  1-4:      Transformation  for  Property  4 


Lei  MC=    (M M    .  M  •.  M 


M^}.     Clearly.    |MCl  =  iMCi.  therefore  MC 


is  a  Maximum  coNer  for  the  Modified  MCP. 

Observation  -  if  a  set  was  not  selected  by  Greedy,  and  it  is  reduced,  then  it  will 
not  be  selected. 

1.6  PROPERTY  5  -  GC  COVER  IS  TOTALLY  CONT.AINED 
WITHIN  MC  COVER 

Notation:     A  -  B  denotes  part  of   set  A  not  in  B.   /4  &  5  denotes  the  intersection 
between  sets  A  and  E 


Proof:  Let  G    be  a  GC  set  that  is  not  completely  inside  MC,  so  that    |G    -  MC!    > 
0.     Let  us  note  that   IG    -  MCl    <    IMC-GCI,  since 

IMCI    =      IMC  <t  GCl   -   IMC  -  GCl 


IGCI    >    iMC  &  GCi   *   iG    -  MCI 


and 


|MC|    >    |GC| 
W.l.g.    we   can   assume   that   our   instance   of   MCP  satisfies   the   first  4  properties.      We 
create  a  modified  instance  of  MCP  bv  replacing  G    with  G  '.  where  G  '  is  obtained  bv 

r  r  r 

eliminating  part  of  C^  not  in  Maximum  cover  and  distributing  it  among  those  M    sets 
that  have  a  "GC-mdependenf   pan  (i.e.   part   that  is  not  covered   by  any  GC  set).     For 
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each  such  M ,  this  GC-independent  pan  is  reduced  and  by  exactly  the  same  amount  the 
intersection  between  M  and  G  '  is  increased  (See  next  figure).  We  do  not  specify 
exactly  how   to  perform  this  redistribution,  however  it  is  possible  to  do  since   the  sum 


of  GC-independent  parts  has  size   |  MC  -  GCl    >    \C 
G    we  are  redistributing. 


MC  I .  the  size  of  the  part  of 


^t/00lf00000000M000 


000000000000000000 


0 

» 

0 
0 
0 
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Original    MCP 


♦ +  +--+  0 

I  i  I  I  il  * 
1  '  I  I  •'l  * 
1      III" 

'         G  0  * 

I  I 


4. jf     +•--  + 

IM'  I  'M'l 


0 

0 

'  i  '  '  <  '  0 

^  \      II  I  " 
]  •♦• — ■♦•  +--+ 

G' 

♦ +^ 

0 

00n000000000000n<'it 
Modified  MCP 


I 


1 


M' ; 

i 

r 

IM    1 

i 

M'  I 

J 

s 

;m  i 

j 
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s 

!G    I 

Figure  1-5:       Transformation  for  Property  5 

GC  cover  in  the  modified  instance  will  remain  the  same,  since  the  improvements  for 
each  of  QQ.  sets  are  the  same  as  in  the  original  instance,  and  improvements  for  each  of 
MC  sets  are  the  same  or  less. 

We  observe  that  each  MC  and  GC  set  in  the  modified  instance  has  the  same  size  as 
the  corresponding  set  in  the  original  instance.  Some  MC  and  GC  sets,  however,  will 
have  bigger  intersections  with  other  sets.  So  each  set  collection  in  the  modified 
instance  will  have  the  same  or  smaller  size  than  the  corresponding  set  collection  in  the 
original  instance.  In  particular,  no  set  collection  in  the  modified  instance  will  be  larger 
than  the  Maximum  Cover  of  the  original  instance.  Hence  the  Maximum  cover  of  the 
original  instance  will  also  be  the  maximum  cover  of  the  modified  instance.  Therefore. 
G/M  ratio  will  remain  the  same  after  one  transformation.  By  applying  this 
transformation  repeatedly  to  GC  sets  that  are  not  inside  MC.  we  will  eventually  get  an 
instance  of  MCP  where  the  Greedy  cover  is  completely  inside  the  Maximum  cover,  and 
with  the  same  G/M  ratio  as  in  the  original  instance.  Q.E.D. 


1.7  PROPERTY  6  -  MC  SETS  ARE  EQUAL  IN  SIZE.  AND 
EVERY  GC  SET  IS  DIVIDED  BY  MC  SETS  INTO  EQUAL 
PARTS 

Proof:     Suppose  we  have  an  instance  of  MCP  without  property  6: 
MCP    =  (N.  K..    IG G.  .  M    M   ! ) 


G^.  M, 


with   QQ^{Q 


and   MC=(M 


M.}.      Without   loss  of  generalit}    we  can 
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assume     thai     it     has     properties     1-5.         Let     us     create     a     new     instance     MCP  '= 
(N.K.  IG  ■ Gj^'.M  ■ M  ■] )  where  Greedy  cover  sets  remain  the  same.  i.e. 

IG'I    =    |G  I    for  1=1.  K 

MC  sets  will  be  constructed  by  a  two  step  process. 

First    we   divide    each    G "    set    into    K    equal    subsets    and    let    each    subset    represent    an 

intersection  between  G '  and  M  "  and  for  some  j. 

|M-  &  g;|  =  |g;|  /  K 

Then,    we  divide   MC  -  GC   in    the  original   instance   into   K   equal   parts  and   add  each 
part  to  M  '  .     At  the  end  all  M  '  are  equal  and  each  has  size   iMC|/K  . 

|M-|   =  JMC|/K 


M 


M 


M' 


M'         M' 


1  2  3 

I 

1 

"  j     !  "" 

I     I 


1      +■ 

•   Original    instance 


G" 
3 

G' 
2 

G' 


1 

2 

3 

Modified  instance 


Figure  1-6:       Example  of  transformation  for  Property  6  with  K=3 
The  set  MC  =    {M  ' M  "}    covers  MC  and   therefore   MC   is  a  maximum  cover 

1  K 

for  the  modified  instance.     Moreover,    |  MC  |    =    |MC|.     Thus  it  remains  to  prove  that 
GC'=  {G  ' G  ■}   is  a  Greedy  cover. 

Definition    1.2:      The   improvement    from    adding   a   A    on    step   L   set    to   a 
partial  Greedy  cover   (Gl ,  denoted  imprfA.  L.    {C) )  is 

impr(A,    L,     {G   } )    -    |a   -      U        G    | 
j  i-l.L      i 


First,  since  sizes  of  G  '  and  intersections  amongst  them  are  the  same  as  the  sizes  and 
intersections  of  G .  we  have 

for  all  1,  L.  j  (1.1) 

impr(G-.  L.    {G1)  =  impr(G.  L.    (G  }) 
Second,  at  each  step  L  in  the  original  instance  we  have 
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max(  impr(M.  L.   {G  } )  )  <  impr(G,    .  L,    (0  } )  (1.2) 

i=l.K 

Lei  R  =  impr(M .  L.  {G } )  and  R '  =  impr(M '.  L.  (G ') ).  Then  sum  of 
improvements  is  the  pari  of  Maximum  cover  not  in  the  first  L-1  sets  of  Greedy  cover 
.  or 

Rl    +   R2   +    . . .    +   RK  =    |MC   -   U      Gj     I 

J-l.L-1 

Rl'    +   R2'    +    ...    +   RK'    =    |MC'    -   U   Gj '     |    =    |MC   -   U   Gj     | 

j=l.L-l  j'l.L-1 

Moreover,  all  improvements  in  the  modified  instance  are  equal,  thus 

R  '=R  •=...=R  • 

Therefore  (since  sum  of  K  numbers  divided  by  K  is  <  maximum  of  these  numbers)  . 

max  (R  *)  <  max  (R ) 
i=l.K    '         i=l.K 

Taken  together  with  equations  (I.l)  and  (1.2)  we  get 

max(  impr(M'.  L.    [G'})  )  <  impr(G, '.  L.    (G  •} )  (1.3) 

'  J  ~  L  J 

i=l.K 

which  means  that  in  the  modified  instance,  at  step  L  of  the  Greedy  algorithm, 
improvement  from  G  '  is  bigger  than  the  improvements  from  any  set  in  the  maximum 

cover.      Since  improvement   from  G  '   is  also   bigger   than   improvements  from  G  _  " 

G  '  we  infer  thai  G     will  be  selected  by  the  Greedy  algorithm.     Therefore  the  Greedy 

cover   for   the   modified   instance   will   be   GC'={G|" C^'}.      Since    |GC|  =  |GC|    and 

|MC=MC|,  G/M  ratio  will  be  preserved. 

1.8  CONCLUSION  OF  THE  PROOF 

We  have  showed  for  each  of  the  6  properties,  thai  if  an  instance  of  MCP  does  not 
satisfy  the  property,  then  we  can  transform  it  to  another  instance  of  MCP.  satisfying 
the  property  and  having  the  same  or  smaller  G/M  ratio.  We  have  also  showed  that 
among  all  MCP'  instances  satisfying  all  6  properties,  the  MCP  instance  from  section  I.l 
has  the  lowest  possible  G/M  ratio  =  1  -  (1  -  l/A")*^.  Therefore,  all  MCP  instances 
have  G/M  ratio  higher  than  or  equal  to  1  -  (1  -  l/K)^.  and  this  is  the  exact  lower 
bound  for  G/M  ratio.     Q.E.D. 
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II.  NOTATION  AND  ABBREVIATIONS 


Single  Letter  Notation 


A.  B. 

c. 

D 

attributes  (i.e..  fields) 

E.  F. 

c 

Boolean  expressions  (defined  in  sec.  1.6 

r 

a  single  record,  with  all  its  fields 

R 

a  stream  of  records 

P 

a  single  pointer 

P 

a  stream  of  pointers 

Q 

a  query 

S 

the  number  of  distribution  steps  (2.2.2) 

T 

a  term 

X.  Y 

constants 

*.* 

index  sets 

e 

a  querytype 

Costs  and  Sizes 

the  number  of  records  in  the  file 

the  blockmg  factor,  i.e..  the  number  of  records  per  page 

the  number  of  pages  occupied  by  the  file 

the  size  of  list  or  stream  L 


C  OF 

the  cost  of 

C  CF 

the  cost  of 

C  GR 

the  cost  of 

C  OKA) 

the  cost  of 

C  C\(A) 

the  cost  of 

C  Q?(A) 

the  cost  of 

C  Q^A) 

the  cost  of 

OPEN_FILE  operation  (sec.  3.4) 
CLOSE_FILE  operation  (sec.  3.4) 
GET_RECORD  operation  (sec.  3.4) 
0PEN_INDEX(>4)  operation  (sec.  3.4) 
CL0SE_INDEX(-4)  operation  (sec.  3.4) 
GET_FIRST(/4)  operation  (sec.  3.4) 
GET_NEXT(/4)  operation  (sec.  3.4) 
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ABBREVIATIONS  and  DEFINITIONS 

admissible  f-scheme  one  that  allows  correct  computation  of  the 

average  query  cost  (ch.  5) 

comp  a  comparison  (as  defined  in  sec.  1.6) 

Comp(A)  a  comparison  on  field  A  (as  defined  in  sec.  1.6) 

COND(node)  the  condition  of  a  query  tree  node  (sec.  3.6) 

CNF  Conjunctive  Normal  Form  (sec.  4.3) 

DA  a  data  access  operation  (sec.  3.5.2) 

Dau  Access  Selectivity  sec.  5.1 

DA_SEL(a<l>)  selectiviiies  of  DA  in  OPT_EM(Q.<t')  (sec.  5.1) 

density  the  field  density  (sec.  2.2) 

DF  a  data  filter  operation  (sec.  3.5.1) 

dominance  between  terms,  written  T  <<  T  (sec.  4.3.1) 

EM  a  query  evaluation  method  (sec.  3.7) 

EM-tree  a  query  tree  with  nodes  labelled  with  stream  operations  (sec.  3.7) 

err  error  code.  0  when  no  error. 

extension  of  *  sec.  6.3 

f-scheme  a  query  frequency  compression  scheme(ch.  5) 

G/M  ratio  |  Greedy  Cover  |    /    |  Maximum  Cover  |   (app.  I) 

GC  the  Greedy  Cover  -  cover  found  by  the  Greedy  method  (app.  I) 

ImprTail(/4)  maximum  tail  improvement  of  A.  sec.  6.3.2 

Index  Scan  Selectivity  sec.  5.1 

IND(Q,<I>)  indices  from  *  used  in  OFT_EM(Q.<i>)  (sec.  5.1) 

INT  a  stream  intersection  operation  (sec.  3.5.4) 

IS  an  index  scan  operation  (sec.  3.5.1) 

IS  SEUQ.*)  selectivuies  of  index  scans  in  0PT_EM(Q.4>)(sec.  5.1) 
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Notation 


list 


a  static  structure  with  all  elements  available  (sec.  3.1) 


logically  equivalent  Q'  can  be  obtained  from  Q  by  transformations 

of  Commuutivity.  Transitivity  and  Disiributivity  (ch.  4) 

Maxlmpr(/4)  maximum  improvement  from  an  index  on  A  (sec.  6.2) 

MC  the  Maximum  Cover  in  MCP  (app.  I) 

MCP  a  Maximum  Cover  Problem  (ch.  6) 

OIS  an  Optimal  Index  Selection  Problem  (ch.  6) 

optimal  EM  lowest  cost  EM  among  all  EMs  on  logically  equivalent  queries  (ch.  4) 

OPT_EM(Q,<i>)  optimal  (under  *)  EM  for  Q  (sec.  5.1) 

potentially  resolvable  an  expression  that  would  be  resolvable  if 

all  fields  were  indexed  (sec.  5.2) 

QT(Q)  a  queryiype  of  Q. 

query  cost  the  cost  of  OPT_E.M(query,<I>),  where  4>  is  the  index  set  (ch.  5) 

querytype  def.  5.7 

rel-op  a  relational  operator,  one  of  =.  >,  <.  <.  > 

Ttsinode)  the  pointer  or  record  stream  -  the  node  result  (sec.  3.6) 

resolvable  node  defined  in  sec.  3.6 

SODS  a  Self-Organizing  Database  System 

selectivity  of  f  a  fraction  of  records  satisfying  E  (ch.  2) 

SEL(£)  exact  selectivity  of  the  logical  expression  E  (ch.2) 

SEL-(£)  estimate  of  SEL(£)  (ch.  2) 

SS  a  sequential  scan  (sec.  3.5.6) 

stream  dynamic  process,  generating  one  element  at  a  time(sec.  3.1) 

subtype  a  querytype  of  a  subquery  (Def.  5.8) 

subquery  Def.  5.6.  sec.  5.1 

TOTCNT  total  count  of  queries  in  the  quer\   set 

tree(Q)  the  query  tree  of  Q  (ch.  4) 


Appendix  II  -  99  -  Notation 

UN  a  stream  union  operation(sec.  3.5.5) 

unresolvable  not  resolvable  (sec.  3.6) 
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